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.... ■ ABSTRACT 

A computer based reliability projection aid, tailored specif- 
ically for application in the design of fault-tolerant computer systems, 
is described. Its more pronounced characteristics include the facility 
for modeling systems with two distinct operational modes, measuring the 
effect of both permanent and transient faults, and calculating condit- 
ional system coverage factors. The underlying conceptual principles, 
mathematical models and computer program implementation, are presented 
in considerable detail. 
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PART I 

The task of predicting computer reliability# and in' particular 
that of fault tolerant configurations, is unfortunately a rather 
complex one that, of necessity, is performed with ever increasing 
frequency. The complexity therein arises primarily as a consequence 
of the rather large set of interactive factors that must be contend- 
ed with if the prediction is to be meaningful. The quickening pace 
is the result of both increased, application as, for example, in long 
life satellites and all digital flight control systems, as well as 
the general . recognition ^ that a multitude of reliability enhancement 
techniques are available and that the proper subset for a given 
application is best selected on a scientific rather than intuitive 

i 

basis. 

In recent years, a variety of computerized reliability models 
have been developed in order to minimize both the difficulty and the 
attendant inaccuracy long associated with such prediction. One or 
another of these gives proper recognition to such individual factors 
as coverage, multiple system operating modes, transient versus perm- 
anent failure, and so on. What is required, however, and not so 
readily available, is a unified model set incorporating the sum total 
of these, as well, as a means of quantifying, rather than simply 
measuring the impact of, the most elusive of these, coverage i (This 
latter term represents the fact that the measure of reliability 
transcends hardware availability, and must in addition account for 
such issues as fault recognition, reconfiguration delay and applica- 
tion imposed time and accuracy constraints.) 

The purpose of this report, then, is. to describe the develop- 
ment and software implementation of a new series of mathematical 
models which are specifically designed to contend with these items. 

It thus’npoftra.ys^a- reli-abi-l-i-ty-assessment__system that provides a 
unique capacity both for calculating coverage. and for measuring re- 
liability as it relates to the totality of these factors. As such, 
hopefully, the system will serve as a moderate step forward in the 
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evolution of a scientific approach to fault tolerant computer 
analysis. 

.1 OVERVIEW, 

The computer program itself, designated CARE2, is designed to 
run on a 60 bit CDC 6000 series computer system, using the RUN For- 
tran compiler, under either the KRONOS 2.1 or SCOPE 3.0 operating 
system. If is available in two versions, the first of which requires 
a field length of approximately 130K octal to load and execute and 
the second, approximately 100 K. The versions differ only with re- 
gard to their ability to provide output data in the form of relia- 
bility plots. 

The program is based, in part, on an earlier- version entitled 
CARE*. The latter is somewhat characteristic of its generation. in 
that; although it calculates computer system reliability for a 
variety of possible configurations, it does not provide an account- 
ing for the effects of transient faults, multiple system operating 
modes and most aspects of coverage. 

The current program, with regard to its modelling capability, 
is comprised of three major and relatively independent portions. 

The first of these, designated "equations 1-6 is the collection 
of specialized reliability models included in original CARE. The 
second, referred to as either "equation 7" or the "dual mode model", 
is unique to CARE2 , and provides the capability for assessing system 
reliability as it relates to the set of aforementioned factors. The 
third, known as the "coverage model", enables the calculation of 
this commodity under a variety of conditions, and thus serves as 
input for yie dual.-mo.de -model- and , --where- applicable , 'equ^iohs 1-6. 


*F.P. Mathur, "CARE Program . (Computer-Aided' Reliability Estimation)", 
Technical Memo No. 361-10, Jet Propulsion Laboratory, Pasadena, 
California, February 24, 1971 


2 




RAYTHEON COMPANY 

FRAYTHEOf^ 


E Q U 1. P-M ENT DIVISION 

t « lilt 1 ^ 



CARE2/ like its predecessor, portrays a computer system as a 
series of one or more independent subsystems or stages, wherein 
system reliability, by definition, is equal, to the product of stage 
reliabilities. In turn, a stage is comprised of a quantity of ident- 
ical units, termed LRU's, a portion of which serve, as active or 
on-line devices and the remainder, as standby spares. Each stage is 
then represented by a single equation (i.e., model), selected on the 
basis of its ability to adequately represent the required physical 
configuration. 

By way of illustration, a typical computer system might be 
viewed as consisting of f cur stages representing, respectively, the 
input, central processing, memory and output functions. In turn, the 
input stage, for example, might itself consist of three units, whose 
outputs are majority voted, and a single standby spare. Given then 
certain simplifying assumptions, the stage can be characterized by 
equation 4, which represents a "hybrid/simplex" configuration as 
defined by Mathur*. 

In addition, however, the current program also provides the 
means for analyzing a computer system (or portion thereof) in which 
the component stages are interdependent such that total reliability 
is not a simple product of stage reliabilities. This situation is 
typified by a multi-stage computer wherein, given insufficient LRU's 
to constitute a fully operational system, a portion of those that 
remain are switched off and a backup or degenerate mode of operation 
is entered (cf. Figure 2-1). In this instance, the failure of an 
LRU in one stage has a conditional effect on the quantity of on-line 
versus spare LRU's in a second and thus, given unequal failure rates 
in the two states, a reliability impact on the second stage. 

This capability, as well as' certain others, is unique to the 
dual mode model (equation 7).' As a consequence of this, and the 
additional f^ct~that the majority -of— equation_ Ij^S^ conf igurajtions 


* op, cit. 
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are but specialized subsets of equation 7, the majority of systems 
can, in practice, be modelled without reliance on the original six. 
The following summary discussion thus centers solely on the dual 
mode and coverage models which, together; constitute both a complete 
evaluation system and the new modelling portion of CARE2. 

.2 DUAL MODE MODEL 

This model, which is appended to the original CARE program 
as the seventh of seven equations, to a large degree supplants the 
preceding six, and thus serves as the focal point for the majority 
of reliability evaluations performed within CARE2. The mathematical 
model which underlies it, though beyond the scope of this summary, 
is presented in detail in Section. 3.1. The capabilities which the 
model provides, from the perspective of the user, are highlighted in 
the following paragraphs. 

As noted previously, CARE2 represents a computer system as a 
series of stages, each corresponding to a single class of functional 
LRU's. The dual mode model, in terms of its software implementation, 
has a capacity for up to eight such stages, wherein each represents 
a grouping of like LRU's consisting of zero or more each of on-line 
and spare units. 

The phrase dual mode is, itself, a reference to an ability. to 
model a system comprised of two distinct quantities of on-line LRU's 
per stage, which correspond, respectively, to the number required 
for operation in each of two system modes. Thus, the object computer 
system is represented by two separately specified configurations 
which, in turn, ■ characterize its hardware complement in both an 
initial and backup state. ^ - - 

iTiis dual mode provision enables, the representation of computer 
systems which, confronted with critical equipment failures, are de- 
signed to reconfigure into a second, and less demanding state. For 

I 

example, in Figure 2-1, mode 1 is the preferred operating state as 
a consequence of its provision for rapidly discerning faults via the 
comparator. Given, however, a permanent failure in either the 
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comparator or all but one of the LRU's of a particular stage, 
operation can no longer continue in that state. At this point, the 
object computer system is designed to reconfigure into its mode 2 
state wherein sufficient hardware, barring further critical failure, 
is available to maintain a "degenerate" form of operation. 

In addition to stages and their component LRU's, the model 
also provides for the representation of two object system "single 
point failure mechanisms." The first of these corresponds to the 
class of switch and comparator /voter failures which, on occurence, 
unconditionally cause degeneration from mode 1 to mode 2. The 
second represents the class of failures, termed catastrophic, which 
cause immediate and total loss of the system. Each is expressed, 
within the model, solely in terms of its corresponding permanent , 
failure rate. 

LRU's, on the other hand, are considered subject to both tran- 
sient and permanent faults, wherein the rate of the latter is, in 
addition, conditioned by LRU status. As a consequence, the model 
provides for a separate LRU failure rate specification for each of 
three failure classifications i.e., transient, on-line permanent 
and standby permanent. 

Given the occurence of mode degeneration in a computer system, 
it is typically accompanied by a partial equipment shutdown in which 
one or more fully functional LRU's are released from active service. 
These units then, as a function of computer design specifics, either 
re-enter the spares pool or, conversely, are not re-assignable and 
thus, from a computer reliability viewpoint, are purged from the 
system. The model provides for either possibility, in accordance 
with a user controlled software switch setting. 

In total then, the dual mode model has the. capacity for rep- 
resenting a computer system with either one or two distinct hardware 
configurations, and consisting of one to eight functional stages. 

In practice, the system is defined in terms of these stages# wherein 
the user specifies, for each stage, the mode 1, mode 2 and spare LRU 
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count, as well as the applicable transient, on-line and standby 
failure rates. Additionally, the two single point failure rates and 
the spares reassignment method (on degeneration) are user specified 
at the system level. 

.3 COVERAGE MODEL 

For purposes of this study, coverage is defined as "the condi- 
tional probability, given a fault and sufficient hardware replacements 
to remedy it, that the system is returned to operational status in 
a form consistent with external time and accuracy needs." In other 
words, it is a measure of the reduction in system reliability due 
to issues other than those of hardware availability, and thus serves 
as an accounting for such factors as the inability to either detect, 
isolate and/or recover from, certain classes of faults. For. example, 
given both a permanent fault in an on-line memory and a fully opera- 
tional spare, the conditions of hardware success are, for the moment, 
met. Those of coverage ,. however , further dictate that the fault be 
correctly attributed to the memory, and that the replacement unit 
then be "reloaded" within certain time constraints. 

It follows that coverage is not single valued for a given 
computer system. For instance, the ability to detect a fault is 
clearly a function of its location within that system. Further, 
the ability to recover, as from a memory fault, may depend on the 
presence of a second unit with identical storage contents, and this 
may be available in only one of the mode configurations. 

From a system perspective then, coverage is readily seen to 
depend on fault subclass (location, as within an LRU) , fault type 
(permanent or transient) and operating configuration (mode) . In 
addition, i-t. is related- to at' least two "other factors; i.e., whether 
the fault results in a mode change (which may entail certain risk) , 
and the status of spare LRU's (which. are subject to prior failure, 
thus presenting a possible requirement for trial selection and test- 
ing in order to locate an operational unit) . 
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The model provides for the calculation of a single conditional 
coverage value for each valid combination of these conditions. In 
turn, the set of computed values are then supplied to the appropriate 
reliability equation for inclusion in the system reliability assess- 
ment. Since, in practice, the original six equations are, at best, 
ill equipped to contend with coverage* the normal recipient is the 
dual mode model. 

Since the same methodology is utilized to compute each of the 
conditional coverage values, the remainder of this discussion is 
addressed to the rationale underlying the modelling and calculation 
of a single value (i.e., for a fixed set of conditions). It should 
be specifically noted, however, that each of the parameters refer- 
enced therein has a potentially unique value for each set of condi- 
tions. 

Given the occurence of a fault, the object system is presumed 
to be equipped with more or less suitable defense mechanisms to con- 
tend with it. The coverage model is based on the assumption that, 
in all but the unprotected case, these consist of one or more fault 
detectors (e.g., memory parity and software selftest) as well as 
associatec3 fault isolation and recovery techniques. In this sense, 
conditional coverage is but an overall performance measure of these, 
as a group, and as they pertain solely to- the fixed conditions. 

In addition, it is assumed that, given multiple fault detect- 
ors (within the fixed conditions set) , they are best represented 
as statistically independent processes. (Conversely, members of- 
separate sets are treated as mutually exclusive) . As such, on 
occurence of a fault, they can be viewed as competing with one an- 
other, though not necessarily with equal vigor. The "winner", if 
in fact there is one, then calls into play, in sequence, its 
associated isblation and recovery processes. 

Gijy^_the_ detection of fault in the object computer system, 

the ability to identify the responsible detector call ^^o^~ih“stru= ' 
mental in constraining the isolation and recovery tasks which follow 
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(i.e., individual detectors have more or less unique properties with 
respect to the type of faults they detect and when they detect them) . 
As a consequence, the coverage model is designed to provide for the 
calculation of the conditional detection probability associated with 
each of the competing detectors, given its presence in a competitive 
situation. 

In turn, these values are highly dependent on the following 
factors : 

• The stand-alone success probability of each fault detector 
(i.e., the percentage of faults it would discern if it 
were the only detector present) . 

• The corresponding stand-alone detection rate associated 
with each. 

• The complete time base or schedular interrelationship be- 
tween each (i.e., individual repetition rates, where 
applicable, zero-offset delays, etc.). 

Given this data as input, the model performs the calculations 
necessary (cf. Section 3.2) in order to establish the interrelation- 
ship that exists between the detectors, and thus provide the condi- 
tional detection probability and rate of each, when in competition 
with the others . 

This information is then utilized, in conjunction with corres- 
ponding fault isolation and fault recovery time profiles, in order 
to produce the contribution to coverage (for the fixed condition set) 
associated with each of the individual detectors. Thus, their sum 
is the conditional coverage value associated with the set of condi- 
tions. 

In this regard, the isolation profile is a rate (i.e., prob- 
_ab i 1 i t y— d en s i-t-y) — f unc ti-on- which r ep r e s e nts “the" p'er f bfina nce“cha r a’c- 
teristics of the isolation process corresponding to the particular 
detector. The referenced recovery profile is, in fact, two separate 
characterizations, both measured in terms probability (rather than 
probability density) versus time. The first of these, referred to 
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as the error propagation function, is indicative of the reduction 
in recovery probability due to the potentially unconstrained propa- 
gation of erroneous data, throughout the computer system, during 
the interval between fault occurence and fault ^detection. The second, 
known as the time delay function, corresponds to the potential dimin- 
ishment of recovery probability as a function of effective computer 
down time, and as measured from fault occurence on through detection, 
isolation and, where required, trial selection and testing of spares, 
until such time as the true recovery process is initiated, (The 
time delay impact associated with the recovery process itself is 
handled implicitly in the recovery probability function.) 

Again, it should be emphasized that the above discussion’ 
describes the coverage calculation process associated with a single 
subclass location, while operating in a particular mode, and when 
confronted by a specific type of fault. Also, it should be noted 
that each of the fault detectors has a unique description, both 
within and between condition sets, and further that the isolation and 
two recovery processes also have unique descriptions for each possi- 
ble combination of detectors and condition sets. 

In perspective, then, the repetition of this process, once for 
each of the possible condition sets, constitutes the process refer- 
red to as coverage calculation. 
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PART II 
SECTION I 
INTRODUCTION 


The purpose of this combined study and software development 
effort was to provide an advanced analytical tool for use in estimating 
the reliability of various fault- tolerant, dual, spare-switching digi- 
tal computer sj^stems. As such, it was to include a means of accounting 
not only for permanent (i.e., hard) faults, but also transient faults 
(i.e,, faults which do not persist, but whose effects can) and coverage 
factor (i.e., the conditional probability of system recovery given that 
sufficient operational hardware is available) . 

As originally conceived, the phrase "dual system" was in reference 
to a configuration comprised of two identical computers, operating in 
parallel, and joined following their digital output stage by a compar- 
ator. Further, each computer was partitioned into four switchably 
replaceable units, and these in turn were backed up, at the system 
level, by a variable quantity of spares. 

In the course of developing .the accompanying mathematical models, 
an opportunity for increasing the problem solving capability rather 
significantly became apparent, but with it, a practical necessity that 
it be incorporated prior to any further software development . The 
resulting change, subsequently agreed to by both parties, amounted to 
substitution of the phrase "dual mode" for "dual system" in the orig- 
inal work statement. With it, however, came a revision of the mathe- 
matical models and software so as to include a capability for evalua- 
ting virtually any configuration involving sparing at the unit level, 
and accompanied by any kind of fault detection, isolation and recovery 
procedures, provided" that only two system -operating modes are required 
and that software based numeric limits are not exceeded. . The resultant 
modeling capability thereby extends, for example, to the majority of 
both "N-modular-redundant" and "hybrid" systems. 
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The software phase of the effort was specified to be implemented 
as an addition to an existing statistical reliability prediction pro- 
gram entitled CARE*. This program was developed, by the Jet Propul- 
sion Laboratory, to "serve as a computer aided reliability design tool 
to designers of ultra-reliable fault- tolerant systems, by facilitating 
reliability computation, data generation, and comparative evaluation". 

CARE'S rather singular emphasis on permanent faults and one mode 
system operation was seen, however, as a rather significant shortcom- 
ing with respect to the defined programmatic requirements, and thus 
indicative of the need either for a serious revision or a nearly 
stand-alone addition to the existing base. 

As a practical matter the latter course was selected, and with 
it, the corollary development of a dichotomy between old. and new 
halves of the program, ^is division, though most pronounced in the 
inner workings of the software, also evidences itself in both the 
input/output structure and the frequent overlapping of capabilities 
wherein a given system can be evaluated with either half, though with 
rather marked differences in terms of modeling depth. 

The new portion of this program, at times supplemented by the 
old, can be used as a reasonably sophisticated analytical tool to 
assess the reliability of a wide variety of fault-tolerant computer 
configurations. In addition, it is, particularly attuned to the needs 
of performing sensitivity analyses on variations of a single config- 
uration including, for example, measuring the effect of changes in 
schedule on- the usefulness of a software self-test routine. 


* F.P. Mathur, "CARE Program (Computer-Aided Reliability Estimation)", 
Technical Memo No. 361-10, Jet Propulsion Laboratory,' Pasadena, 
California, February 24, 1971 
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It should be specifically noted, however, tha't the necessary 
inputs to the program include a statistical specification of all 
significant fault detection, isolation, and recovery mechanisms 
within the system, i.e., it must be furnished data describing the 
properties of the various coverage features included in the object 
computer. The generation of this data is, in some cases, quite 
difficult, and the values obtained, often subjective. This is seen, 
however, not as a shortcoming of the study, but rather as an indica- 
tion that reliability prediction is, itself, a difficult undertaking, 
and that much additional work remains before it can truly become a 
science. 

The following sections describe the results of the subject 
effort, including development of the conceptual base, its implemen- 
tation in terms of mathematical models and subsequently software, 
and instructions on its utilization in the form of a "users guide" 
for operation of the program. 
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SECTION 2 

CONCEPTUAL DEVELOPMENT 

The achievement of fault tolerance in a reconfigurable computer 
system transcends the issue of hardware survivability; a second and 
equally essential prerequisite is that of fault awareness and the 
consequent ability to re-establish control successfully in a properly 
reconfigured system. This necessitates that the system include a 
methodology for detecting, isolating, and ultimately recovering from 
any of a necessarily large percentage of faults. Collectively, these 
methods are referred to herein as "fault tolerant features" (FTF's), 
and their integrated function is to provide "coverage". 

At issue then, in a properly directed system design process, is 
the selection of an effective FTF set for a particular configuration, 
with the limiting constraint being that each provides a degree of 
coverage only at the expense of additional hardware and/or reduction 
in effective computational throughput. Up to a point therefore, the 
addition of FTF's enhances the probability of successful operation; 
beyond that, the ever decreasing gain in coverage (via overlap of 
faults covered) is quickly overshadowed by the failure mechanisms 
inherent in the FTF implementation hardware. 

This study, which is itself the outgrowth of a current necessity 
for quantifying fault tolerant computer performance has, therefore, 
as its fundamental objective, the development of a computer based 
method for assessing system reliability as it relates to both hardware 
configuration and FTF's. employed therein. 
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Of the three main tasks in this effort^ deriving the mathematical 
reliability model for a dual mode system, providing for calculation 
of the coverage factor, and extending the existing CARE program accord- 
ingly, the second is undoubtedly the most complex, and necessarily the 
most open to question. Consequently, much of the following discussion 
centers on this particular aspect of the study. 

2.1 GROUNDRULES AND ASSUMPTIONS 

In order to place bounds on both the magnitude of the study and 
the execution time required for the resulting computer program, certain 
simplifying assumptions must necessarily be made. At the same time, 
however, it is of even greater importance that all governing perform- 
ance factors be properly accounted for. What is called for, then, is 
a somewhat delicate balance between too much and too little. Unfort- 
unately, as, is often the case when developing models, the quantitative 
information most desired for the decision making process is available 
only as an output from, rather than input to, the modeling task. ' As 
a consequence, therefore, it is necessary to. rely, rather heavily on 
intuitive processes rather than known phenomena. 

In this study, the following considerations were judged to be both 
necessary and sufficient in order to provide for a representative re- 
liability assessment: 

2.1.1 Fault Simultaneity 

Independent and simultaneous, or near simjultaneo.us, .fault 

occurrences are assumed to be sufficiently improbable as to justify 
ignoring them, both for the purpose of system evaluation and for 
that of system design as well. 
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2.1.2 Failure Rate 

Constant values are assumed throughout. In particular, three 
separately specified quantities are provided for each of a quantity 
of LRU (line replaceable unit) types: 

• X - the permanent-failure rate of on-line 

(i.e., active) LRU's 

• M - the permanent-failure rate of standby 

(i.e., spare) LRU's 

• 7^ - the transientr-failure rate of on-line LRU's. 

In addition, two single-valued system level rates are included: 

• ^2 “ the permanent- failure rate of system components 

whose loss' forces an immediate degeneration in 
operating mode (e.g., certain non-catastrophic 
failures in an output voter or comparator) 

• ^3 “ the permanent- failure rate of system components 

whose loss is catastrophic (e.g., loss of all 
input power to the system) . 

2.1.3 Coverage 

For purposes of this study, coverage is defined as "the condit- 
ional probability, given a fault and sufficient hardware replacements 
to remedy it, that the system is returned to operational status in a 
form consistent with external time and accuracy needs". In turn, 
its value is viewed as functionally dependent on three conditional 
probabilities: 

• The probability, given a fault, that it is detected 

• The probability, given a detected fault, that it is 
properly isolated 
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• The probability, given a properly isolated fault and 
sufficient spares to remedy it, that a satisfactory 
cure is effected. 

2.1.4 Coverage Conditions 

Though often discussed as though it were a single valued quan- 
tity for a given system operating in a given environment, coverage 
is more accurately a set of values, each of which corresponds to a 

specific situation. In particular, the degree of coverage offered is, 
at a minimum, dependent on the following conditions; 

• operating mode - whether the system is fully opera- 
tional (e.g., 2 out of 2 channels functioning) or 
degenerate (e.g., 1, out of 2 channels functioning) 

• fault location - whether the fault lies, for example, 

. in the. system's input, CPU, memory bit plane, memory 

word electronics or output area 

• fault type - whether the fault is permanent or transient 

• spares status - whether initial replacement units, .as 
selected from the spares pool to remedy a fault, are 
themselves operational or faulty 

• degeneration requirements - whether or not it is 

necessary, as a function of spares status, to change 
operating mode (i.e., to undergo transition) in the 
event of a, fault. , 

A¥”iT~consequence, a unique coverage value exists for each com- 
bination of conditions, and must therefore be accounted for in the, 
reliability assessment. 
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2.1.5 System Failure 

It is assumed throughout this study that a coverage failure 
inevitably results in total system failure. Specifically, this 
includes: 

• - failure to detect a fault 

• failure to correctly isolate a detected fault 

• failure to recover from a correctly isolated 
fault. 

(In addition, of course, lack of sufficient hardware to constitute 
an operational, degenerate system also leads to the same consequence.) 

While this is in general a reasonable assumption, it is 
possible to conceive of reconfiguration strategies that recover from 
certain transients, which would otherwise be disabling, by. changing 
to a degenerate mode of operation (e.g., by entering a simplex mode). 
Similarly, in certain situations, other forms of coverage failure 
could conceivably also lead to a degenerate mode rather than to system 
failure. 

2.1.6 Spares Reassignment 

Given a degenerative failure, a quantity of fully operational 
on-line LRU's are typically released from active service. Depending 
on system architecture specifics; those so released may or may not 
be available for use as spares in the event of subsequent failure. 
Since either situation may prevail, it is necessary that both be 
accounted for in the model, with the determination of which is most 
“appropriate-for -representing a, -given_configur^i^on left to the 
discretion of the user. 
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2.1.7 Spares Failure - . - 

Since spare LRU's are themselves subject to failure {at 
rate m ) / it is necessary to provide for situations in which the 
spare selected to replace a failed on-line LRU has itself failed, 
and must consequently be replaced by yet a second (or, in the 
general case, nth) spare. In the model, this is provided for in 
the form of a single delta coverage value per fault classification 
per operating mode, wherein the as calculated value is a measure 
of the reduction in recovery probability resulting from incurred 
downtime during the conditional trial and error replacement period. 
In calculating this, it is assumed that failed spares are purged 
from the system in bulk on the occurence of mode degeneration, 
and as individuals following their selection as a replacement unit. 

2.1.8 Fault Classification 

The ability of a given detector to recognize a specific fault 
is, in most instances, highly dependent on the precise nature of the 
fault. For purposes of this study, it is assumed that two levels 
of classification are adequate for the purpose of so categorizing 
faults. The first of these, referred to as fault class, is the LRU 
type (e.g,, CPU or memory) in which the fault occurs. The second, 
or fault subclass, is an arbitrarily selected portion of an LRU (e.g 
arithmetic unit in a CPU) . - 

Thus, by proper partitioning of the object system represents 
tion, first into LRU's or classes, and then into subclasses, the var- 
ious faults can be categorized according to their specific detection 
characteristics . 
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2.1,9 Fault Detection 

The representation of an object computer system as a series , 
of fault subclasses underlies the approach to coverage measurement 
utilized in this study. Specifically, each subclass is by design 
a grouping of possible faults, so categorized solely on the basis 
of their common susceptibility to one or more fault detectors. In 
other words, the set of fault detectors associated with a particular 
subclass is, be definition of the subclass itself, comprised wholly 
of statistically independent members. In turn, such independence 
implies that each member detector has a finite probability of detecting 
any of the possible faults within the subclass, and that it therefore 
competes with all other members toward the objective of being first 
to detect a local fault. 

This concept of competing fault detectors should not be 
construed to imply that a degree of equality need exist between 
them; on the contrary, the contest is of necessity as imbalanced 
as reality dictates. 

For a moment perhaps, it is useful to look at the performance 
characteristics of a stand-alone detector i.e., one which has no 
competitors within the subclass. Given both singular occurences of 
random faults and infinite time, the detector has an associated 
non-zero probability of detection. This value, as well as the 
corresponding detection rate profile, forms a substantial portion of 
the detectors' characterization. 

In addition however, it is necessary to consider the 
occurence rate of the detector itself i.e., the timing basis or 
schedule on which it executes. Two basic categories are apparent: 

■ “periodic and -aperiodic-. - - -The - former- is _populated, in t^ main, by 
software based detectors (e.g,, software selftest), whereas the 
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latter group consists primarily of hardware detectors (e,g.> memory 
parity). (In both of these cases, it is assumed that the initial 
execution of the test, if unsuccessful in detecting a specific prev- 
ious fault, precludes any possibility of later detection, i.e., the 
fault is transparent - to the te'st.) 

Returning then to a competitive situation, and given each 
detector characterized in terms of its stand-alone detection probabil- 
ity, detection rate and schedule, it would be a relatively simple 
task to compute an overall subclass detection probability and expected 
time, based on the interaction (i.e., competition) between detectors. 

It would not be sufficient, however, for the following reasons: 

• knowledge of the specific test by which a fault is detected 
is frequently utilized, in the object computer, to constrain 
the isolation process 

• individual detectors vary in their ability to either pre- 
vent or minimize fault propagation (i.e., contamination of 
correct information via improper processing) . Since this 
can haye, major influence on the probability of recovery, 
knowledge of the specific detector involved must be made 
available. 

It is necessary, therefore, that a conditional detection prob- 
ability and rate be computed for each subclass member, as the values 
relate to the totality of competitive detectors. 
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2.1.10 Fault Isolation 

Unlike fault detection, the isolation of faults is assumed to 

be a single-threaded process in which a lone isolator seeks out the 
source of difficulty. Since the object system, in all probability 
however, incorporates multiple isolators (each tailored to. a specific 
task), it is necessary to provide accordingly in the model. 

In particular, it is assumed that each detection process is 
uniquely linked to an isolator such that, given detection of a fault 
by it, the corresponding isolator is unconditionally brought into 
play. In turn the isolator, by virtue of its tailored design, in- 
herently capitalizes on whatever unique characteristics the detector 
may possess in order to constrain the bounds of the search. 

By way of example, the isolation process associated with 
detection by memory parity is clearly minimal (if in fact it exists 
at all) , whereas detection by an external voter implies nothing in 
the way of fault location, thus mandating a far-ranging and potentially 
time-consuming search. 

As in the case of fault detectors, fault isolators are seldom, 
if ever, perfect^ and thus have both a non-unity success probability 
and a finite process rate. For analogous reasons then, this must be 
accounted for in the impending fault recovery process, 

2.1.11 Fault Recovery 

The hardware aspects of fault recovery are reasonably well 
disciplined. Either a spare hardware unit and the needed switching 
apparatus are available, in which case that portion of the cure can 
be effec-ted, or. they are not, in which case the system either degen- 
erates or fails completely (barring the use of techniques such as 
those of "graceful degradation" in the event of partial memory loss, 
etc. ) . 
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The coverage portion of fault recovery, on the other hand, 
must contend with all issues save those of hardware availability, 
and is thus confronted with a rather unwieldly grouping. of consider- 
ations which, in turn, are considerably more difficult to deal with, 
both in the object system itself and, consequently, in the model as 
well. 

In particular, it is felt that each of the following are of 
sufficient importance as to necessitate inclusion in the model; 

• the subclass in which the fault is located (for reason 
of its potential influence on recovery difficulty) 

• the time delay between fault occurrence and fault recovery, 
as it relates to the issue of system downtime. (Of nec- 
essity, this incorporates any time devoted to spares 
checkout including repetitions, where required, to locate 

a functional spare) 

• the degree and location (s) to which fault propagation 
was contained, and the corresponding time interval during 
which it was unconstrained. (The model assumes here, as 
a first order approximation, that this interval starts 

on occurrence of the fault and ends on its detection. 
Compensatory adjustments, where called for, can be achieved 
by modifying the degree of sensitivity ) 

• the detector responsible for finding the fault, as it 

• relates both to the degree of fault propagation and the 
- .—time required" for detection 

•’ the system operating mode at the time of fault occurrence 

• the type of fault involved (permanent or transient) 

• the need, or lack of it, for mode degeneration. 
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In addition# successful recovery is assumed to be conditioned 
by the following: 

• successful detection of the £ault 

• successful isolation of the fault 

• in the case of a permanent fault# availability of a , 

^ spare LRU or# barring that, sufficient on-line LRU's 

to maintain operation in a degenerate mode 

• in the case of a permanent fault# successful detection 
of any failure present in a selected spare LRU. 

2.1.12 D/i/R Mechanisms 

Since each of the object system fault detection (D) # 
isolation (I) and recovery (R) processes has the potential for 
reacting differently to a fault as a function of its location 
(subclass)# the type of fault (permanent or transient)# the system 
operating mode .(fully operational or degenerate) and any requirement 
for degeneration (given certain hardware losses in a fully operational 
system) # it is necessary to provide a means for modeling each possible 
reaction separately. The total set of system reaction descriptions 
corresponding to detection by a particular detector is then referred 
to as a D/l/R mechanism. 

By way of illustration# take the case of a CPU selftest 
fault detector. As its name implies# it is most proficient at 
detecting fault which occvir in the CPU; nonetheless# it exhibits a 
finite capacity ”to" detsct 'those "in^ other subclasses as~well. For 
instance# given that the test executes out of memory# it follows 


2~ir 



RAYTHEON COMPANY 

^AYTHEO^ 


E.QUIPM'ENT DIVIS ION 




that certain faults located therein will cause it to 
misexecute/ thus resulting in an eventual fault indication (albeit 
for the wrong reason) . Similiarly, though the test is reasonably 
adept at detecting permanent CPU faults, its capacity in the 
presence of transients is, at best, marginal. 

It follows then, in the general case, that a detectors' 
performance may be expected to vary as a function of all the 
aforementioned environmental conditions. In analogous fashion, 
the isolation and recovery processes associated with that detector 
may likewise be expected to exhibit similar dependencies. 

In context, then, a D/i/R mechanism is the total set of 
performance descriptors pertaining to a particular fault detector.. 

As such, it consists of separate detection, isolation and recovery 
performance characterizations, one each for every meaningful 
combination of conditions. 

2.1.13 Performance Characterization 

As noted, each of the object system fault detectors, together 
with its associated fault isolation and fault recovery techniques , is 
represented in the model as a D/l/R mechanism. Each mechanism is 
then represented by a set of performance characterizations, one each 
for all combinations of detection, isolation and recovery processes 
across a spectrum of conditions. 

In turn, each of the individual performance characterizations 
is expressed in terms of a function specification and a referenced 
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function model. The latter are parametrically expressed probability 
and probability density functions which serve as a representative 
form or pattern for the characterization, and the former 
provide specific ntuneric details to round out the characterization. 
The end object of these (i.e., a fully described process represen- 
tation) is then referred to as a fvinction. 

In particular, the following set of function models is 
viewed as sufficient to enable adequate characterization of the 
majority of known processes; 

• Impulse - a burst of zero duration with finite 
integral, representing an instantaneous process 

• Constant - a pulse with finite duration, representing 
a constant process 

• Finite Pulse String - a series of equally spaced 
pulses, representing a discontinuous process 

• Exponential - a decaying exponential, representing 
a continuous, but ever decreasing, process. 

In practice (i.e., in the software implementation of the model), 
each function specification is identified, for reason of user 
convenience, by means of a tinique function nvimber. As a consequence, 
like process characterizations, once defined, can be elicited 
repeatedly via their identifying number. 
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viewed as a whole, D/l/R mechanisms are utilized, on a one 
for one basis, to describe the totality of performance character- 
istics associated with each fault detector. In turn, mechanisms 
are defined in terms of function nximbers, one each for all 
combinations of conditions and processes. Each fxinction number 
corresponds directly to a function specification which, in turn, 
contains both quantitative data and a reference to a particular 
function model. The latter then is a parametric expression which 
describes the general form of the performance characterization. 

2.1.14 DIET vs. D/i/R Process Representation 

In previous sections, frequent reference has been made to 
the detection, isolation and recovery process representations which 
constitute a D/I/R mechanism. In particular, the requirements for 
modeling each were developed in sections 2.1.9, 10 and 11 respectively, 
their relation to a D/i/R mechanism presented in section 2.1.12, and 
the general method utilized to specify them described in section 
2.1.13. 

As a final note, however, it should be recognized first, 
that given both a particular detector and set of environmental 
conditions, there exists a singular set of process representations 
which describe the corresponding detect ion/isolation/recovery 
sequence and second, that the set consists of four representations, 
one each describing the detection (D) , isolation (I) , error- 
propagation-recovery (E) and time-lost-recovery (T) facets of the 
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corresponding mechanism. The latter subdivision (of recovery into 
components I and E) is a direct consequence of the rationale 
presented on page 2-10, second and third items respectively. 

2.2 SAMPLE FUNCTION SELECTIONS 

Although the objective of this study is to develop a 
relatively general-purpose system reliability .prediction tool, an 
underlying goal is its application to a specific .fault tolerant 
computer configuration; a dual channel system consisting of input, 
CPU, memory and oitput LRU's, plus an external comparator (cf. figure 
2 - 1 ) . 

TO this end, D/l/R mechanism definitions and performance 
characterizations have been made for a set of fault detectors, each 
member of which has potential application in the subject system. 

Table 2-1 summarizes these, wherein the reader's leftmost column is 
a list of the detectors (and equivalently, the D/l/R mechanisms) . 

The double row to the right of each detector defines the mechanism 
as it relates to varying environmental conditions, the latter con- 
sisting of combinations, of fault subclass (Input vs. CPU etc., across 
the top heading row) , fault type (Permanent vs. Transient, across 
the second heading row) , and system operating mode (2 channel vs. 

1 channel, down the second column from the left) . In turn, 
each rectangle (representing a single mechanisms performance vinder 
— specif ic- -conditions j--cont a i-ns--either -a zero- (representing a null- 
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RELIABILITY DIAGRAM 
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response) or a series of four numbers. In the latter case# the 
four correspond to detection (D ) , isolation (I ) , error-propagation- 
recovery (E) and time- lost-recovery (T) function numbers# 
respectively# as indicated by the third heading rov/. 

The function numbers referred in Table 2-1 are defined# 
in Table 5-4# in terms not yet presented. The rationale leading 
to their specification and selection# however# is discussed at length 
in the remainder of this section. Prior to initiating this# though# 
the following points should be noted; 

• individual detectors have been purposely interpreted# 
in terms of their scope# in a somewhat restrictive 
fashion. For example# software selftest can be 
, viewed as including some form of memory addressing 

scheme check; in terms of this study# however# 
each has been interpreted to be a separate and dis- 
tinct test# and categorized accordingly. 
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certain detectors, in the process of searching out faults in 
a particular unit, have an inherent capacity to detect faults 

in others as well. This capacity, though perhaps beyond 
the intended scope of the test is, in many cases, significant 

enough to warrant inclusion in the model -set. 

in many instances, the ability of a detector to find faults 
in a particular unit, though possibly non-zero, is of insuf- 
ficient magnitude to warrant consideration in the study. 

the time at which a fault actually occurs is viewed as im- 
material and, in its place, emphasis is placed on the time 
at which it can first affect other processes. 

Detection Function Rationale 

The following subsections describe the reasoning behind the 
detection function selections shown in Table 2-1: 

2 . 2 . 1 . 1 Memory Code 

Any such code, for example, Hamming or simple parity, offers 
the same form albeit not the same degree of protection. This can be 
characterized by noting that: 

• in a typical implementation, only memory bit plane (i.e., 
data) faults are protected 

• the degree of protection offered is independent of the 
quantity of channels operating and independent of whether 
the fault is permanent or transient 

• to the extent that a given faulty bit pattern is covered, 
the fault will be detected on reading the cell and, thus. 
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given a reasonable computer and software design, mission 
computation will cease prior to any consequent incorrect 
processing. 

The latter gives rise to a belief that any such code can be modeled 
via an impulse. Further, since error propagation is precluded, the 
effective delay time between fault occurrence and fault detection is 
zero. 

2 . 2 . 1 . 2 Output Compare/Vote 

The technique requires the presence of at least two operating 
and fully synchronized channels and is, therefore, inoperative in all 
one-channel situations. Given two or more, it is capable of detecting 
all faults save those in the output unit, on their passage from chan- 
nel to comparator (cf. Figure 2-1). 

On the assumption that the compara tor /voter receives all 
digital output commands, but no others, its detection rate can be 
characterized as a decaying exponential training off eventually to 
zero. (The presumption here is that all meaningful faults ultimately 
result in the output of erroneous values, and that the probability as 
to when is highest immediately following fault occurrence and ever 
diminishing thereafter.) 

Since detection is presumed to occur on output, external, 
but not internal, faults are precluded from propagating. 

2. 2. 1.3 I/O Wraparound 

Test performance is independent of the quantity of channels 
operating. In effect, only permanent faults are detected in that 
transients can be observed only in the situation where they occur 
during performance of the test case. Such faults affect only test 
data. 
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The test, by design, checks out portions of the input and 
output units; indirectly, however, it affords some detection of faults 
in the remaining units by virtue of its • dependence on them for opera- 
tion of its software-based implementation. 

The detection rate can be characterized as a pulse (i.e., 
a constant detection rate over its operating interval) in that the 
test consists primarily of the sequential checkout of multiple chan- 
nels, each with equal likelihood of having failed. Because it is a 
periodic (i.e., scheduled) test, it affords little protection against 
fault propagation. Similarly, the delay time between fault occurrence 
and fault detection is dependent on when the fault occurs with respect 
to the scheduled test time.. 

2. 2. 1.4 CPU Software Self test 

The expression software selftest often implies a medley of 
assorted tests. For purposes of clarity, all but one of these, 

CPU test, have been broken out and categorized separately. 

The residual test is seen as offering no protection against 
meaningful transient faults and no protection against input or. output 
faults. By design, it detects the majority of CPU faults, with the 
rate function seen as a truncated decaying exponential. In this 
sense, the cutoff portion represents those faults which are not de- 
tected, and the non-zero duration of the truncated exponential cor- 
responds to the run time of the test. The decaying exponential fom 
represents a belief that the majority of faults (for reason both of 
test design and hardware requirements to run the test) are detected 
near the beginning of the test and that, from there on, it is a fight 
against ever-diminishing returns up through test end. 

The test, by nature, of its being a scheduled event, precludes 
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little in the way of fault propagation' arid, for the same reason, 
varies in its detect time as a function of the relationship between 
fault occurrence time and test run time. 

Since the test operates out of memory, it offers a degree 
of fault detection with respect to memory-based faults . 

2. 2. 1.5 Reasonableness Tests 

The tests are interpreted to be a series of short duration 
checks, primarily on the validity of recently calculated data, which 
are performed throughout the epoch cycle. As such, their performance 
is' unaffected by the quantity of channels operating, as well as by 
whether the fault is permanent or transient. By reason of inacces- 
sability to internal processes, output faults cannot be detected. 

The use of a zero delay impulse model to characterize its 
detection rate function for input faults, is based on the inherent 
capacity of the test to preclude certain faults from entering the 
system. This in turn is seen as equivalent to a zero detection time. 

On the other hand, CPU and memory are provided with unsched- 
uled pulse rate functions, as warranted by their aperiodic character- 
istics (with respect to fault occurrence) and the consequent variant 
time interval prior to detection. 

2. 2. 1.6 Timeout Counters 

Such counters can be incorporated in any or all channel sub- 
units, though incorporation in memory is viewed as offering word 
select protection only. In all cases, other than CPU, the tests can 
be adequately characterized by means of an unscheduled zero-delay 
impulse, due to their essentially instantaneous seryice. The CPU 
counter, on the other hand, is presumably a watchdog timer and, . thus. 
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has a delay time measured in milliseconds. Since the time of detec- 
tion is independent of when the fault occurs (within the timeout 
interval), the test appears as an unscheduled pulse. Consequently, 
little protection against fault propagation is offered. 

All forms of the test are relatively insensitive to whether 
the fault is permanent or transient, and totally insensitive to the 
quantity of operating channels, 

2. 2. 1.7 CPU Codes 

Though other forms are possible, this class of test is typ- 
ically implemented as a residue or product code applied to, the ALU, 
data path and register portions of a CPU. As such, it offers singular 
protection against a large subclass of CPU faults, and does so inde- 
pendent of whether the fault is permanent or transient. The quantity 
of computers operating has no in^jact. 

Since faults are detected on occurrence, all forms of fault 
propagation are precluded, and, thus, the proper model- is an unsched- 
uled impulse of zero delay. 

(Note that the existence of fault partitions within the CPU 
(from the view of this particular test) suggests rather strongly that 
the CPU might better have been treated as two subclasses for coverage 
computation purposes (i.e., one containing the ALU and the other, the 
control unit.)) 

2 . 2 . 1 . 8 Memory Sumcheck 

The technique, which consists of occasional exclusive-or sum- 
checks on the contents of invariant subportions of memory, offers a 
significant degree of memory bit plane and word select fault detection. 
In the case of bit plane faults, the protection afforded is unaffected 



2-23 




RAYTHEON COMPANY 

RAYTHEOf^ 


E‘Q U M E N T ' DI VI S lO N 

i « ■ i 1 i t W 



by whether the perturbation was transient or permanent, whereas in 
the word select case, only permanent faults (in any meaningful sense) 
are detected. In both cases, the quantity of computers operating has 
no impact. 

Since the test, by virtue of its rather excessive run time, 
is typically partitioned into a series of . N subtests, each operating 
on a portion of memory, it can be characterized by a string of N im- 
pulses, typically extending beyond a single epoch cycle. Due to the 
scheduled nature of the test, detection time is aperiodic with respect 
to fault occurrence, and little protection is offered against either 
form of fault propagation. 

As a consequence of the test being implemented in software, 
a small degree of protection is offered' against CPU faults. 

2.2. 1.9 Invalid Instruction 

This test, in many respects, serves more as an indidicator 
of software errors than hardware faults. For completeness it is, 
however, included. 

To the degree that faults are in fact detected, the protec- 
tion is independent of both the quantity of opera ting '.computers, and 
whether the fault is permanent or transient. No protection against 
input and output faults is offered. The detection of CPU and memory 
faults, though severely limited in scope, is instantaneous and inde- 
pendent of the time of fault occurrence. 

2.2.1.10 Read/Write Protect 

The test, assumed to be. implemented in hardware, serves 
primarily as a software error detector. . It does, however, .have limited 
hardware fault detection capabilities, and is, therefore, included. 
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No protection is offered in the case of input and output 
faults. CPU and memory faults are in select cases detectable and, 
to this extent, the protection is independent of the quantity of com- 
puters operating and also independent of whether the fault is perma- 
nent or transient. 

Detection, when it occurs, is instantaneous and, thus, the 
appropriate model is the zero delay iit^ulse. 

The following, severely limited., classes of faults are de- 
tectable by this technique: 

• CPU - limited instances of incorrect memory address 

computation 

• Memory. bit plane - situations where the address portion 

of an instruction stored in memory is, on access, 

. incorrect to the extent of crossing a protection 
. boundary 

• Memory word select - instances where an instruction 

fetch is performed on an incorrect memory cell, 

which itself contains an out-of -range address. 

In these situations, fault propagation is precluded. 

2.2.1.11 Address Feedback 

The technique, though seldom discussed, offers a significant 
degree of singular protection against memory word select faults, Im^ 
plementation techniques vary, with one of. the more straightforward 
methods consisting of the encoding of address and data, rather than 
merely data, on storage of a word into memory. Given then a reason- 
ably strong code, for example, six-bit Hamming, a high percentage of 
addressing faults can be detected on readout via the decoding process. 
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Protection is instantaneous, and independent of fault occur- 
rence time, quantity of operating conputers, and whether the fault is 
permanent or transient. Fault propagation is precluded. 

' V 

2.2.1.12 Addressing Scheme Check 

Given either a 2 J 5 D or 3D memory addressing scheme (prefer- 
ably the latter) , this technique offers a reasonable form of memory 
word select verification. Its methodology, which is based on the use 
of coincidence for word selection, enables the bulk of the memory 
addressing circuitry to be verified (the exception being that fraction 
which is unique to the variable portion of storage) . 

The technique, which relies on the exclusive or sumcheck 
testing of a specific small subset of memory locations, detects 
addressing faults by virtue of the implication that correct sumcheck 
signifies correct addressing. The test is implemented in software 
and, thus, also offers a small degree of CPU and memory bit plane 
checkout. The latter is supplanted, to a minor extent, via the sum- 
check process itself which verifies the contents of the memory subset. 

Because of its software basis, the test is necessarily peri- 
odic and, therefore, incapable of precluding fault propagation. Simi- 
larly, the time required to detect is dependent on the relationship 
between fault occurrence time and test run time. The test is not 
dependent on the quantity of computers running. Little or no prac- 
tical protection is offered against transient faults. 

2.2.2 Isolation Function Rationale 

Three functions have been selected to represent the isolation 
process for this particular implementation of the four LRU, dual- 
channel system (cf. Table 5-4). 
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2. 2. 2.1 Primary Isolation Processes 

This isolator, represented as a slightly delayed impulse 
with unity coefficient, typifies the situation in which the fault 
location implied by the initiating detector was, in, fact, the actual 
location. For example, in the case of CPU self test, the implied 
location is the CPU itself and, as a consequence, the first area 
searched in the isolation process. 

Given that the implication is correct, the corresponding 
determination of this fact is made after a short delay (as measured 
from the end of the detection process) which itself represents the 
duration of the isolation function. 

2. 2. 2. 2 Secondary Isolation Processes 

In the situation where the fault location lies other than 
in the detector implied LRU, the search is both misguided (initially) 
and, of necessity, quite extensive in domain. Correspondingly, it 
is best modeled as a pulse of long duration which, in turn, repre- 
sents a process with constant isolation probability throughout the 
total interval, 

2. 2. 2. 3 Tertiary Isolation Processes 

The isolation process is most complex, and least likely prone 
to success, in situations where either a transient fault has occurred, 
or where the fault is totally unlocalized (as with detection by a 
comparator external to the channel) . Consequently, an impulse model, 
with long initial delay, has been selected as reasonably representative. 

2.2.3 Fault Propagation Recovery Function Rationale 

Two functions , have been selected to represent the fault propa- 
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gation aspect of the recovery process. The first of these is a pulse 
of long duration, and represents a situation in which the recovery 
probability is undiminished, from this effect, up through the end 
point of the pulse. It corresponds to either of two situations: 

• the associated fault detector precludes any appreciable 
form of fault propagation 

• fault prc^agation is of little consequence, due to the 
presence of an operational second channel which can be 
utilized for update purposes 

The second function corresponds to the opposite situation, 
in which fault propagation is of considerable potential effect, and 
the function is, thus, provided in the form of a decaying exponential, 
whose diminishment with time represents a like decrease in recovery 
probability. It is utilized primarily in the combined situation of 
degenerate system operation and the presence of fault propagation. 

2.2.4 Time Lost Recovery Function Rationale 

The time lost aspect of fault recovery contends with the effect, 
on external-to-the-system operational requirements, of. channel loss 
during a "fault-present" interval. In the situation where both chan- 
nels were operative prior to the fault, one channel normally continues 
servicing external needs, and, thus, the overall effect is minimal. 

A long duration pulse, is applied in this case. 

In the situation either of previous degenerate operation or 
■ fault detection by the comparator, however, the total system, is 
temporarily out of service, and, thus, no external processing re- 
quirements are met. (Detection by the comparator' is assumed to re- 
quire that both channels cease normal operation and enter into the 
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isolation phase concurrently.) The function selected to represent 
this situation is a decaying- exponential with cutoff in the region 
of mission critical time. 

2.3 TERMINOLOGY DEFINITIONS 

Batch Run - The total data processing task performed at one time. 

It may consist of one or more run-sets. 

Category 1 Switch Failure - A failure, typically of spares-switching- 
hardware, such as to prevent proper operation of a single unit. This 
possibility must be accounted for in estimating the failure rate of 
the unit. 

Category 2 Switch Failure - A failure, typically of spares-switching- 
hardware, such as to force degeneration of the system from mode M to 
mode M+1 . 

Category 3 Switch Failure - A failure, typically of comparator or 
voter hardware, such as to disable the entire system. 

Channel - The minimum complement of operational hardware required to 
perform the system computational task. 

Coverage - The conditional probability, given both an on-line unit 
failure and sufficient spare hardware to maintain computational pro- 
cesses, that the failure is detected and operation is successfully re- 
established, consistent with application imposed time constraints. 

Coverage Model - A set of equations for evaluating the ability of a 
system to recover from permanent or transient hardware faults in a 
spares-switching environment. Also, the implementation of this model 
in CARE2. 

Degenerative Failure - Any condition which necessitates reconfiguration 
of the system, from Mode M to Mode M+1, in order to regain full pro- 
cessing capabilities. 
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D/i/R Mechanism - (Detection/lsolatlori/Recovery Mechanism) - A 
coverage model representation of the conditional performance 
characteristics of all detection, isolation and recovery processes . 
associated with a particular fault detector. In turn, the conditions 
are comprised of valid, combinations of fault subclass, mode, fault 
type and potential requirement for mode degeneration. 

Dual Mode Reliability Model - A set of equations for evaluating the 
reliability of a system, given that the system has no more than two 
distinct and relevant operating modes. Also, the implementation of 
this model in CARE2. 

Fault - An event which perturbs computational processes within the 

computer system and, as a consequence, requires remedial action, in the 
form of detection, isolation and recovery, in order to preclude an 

operational anomaly. 

Fault Class - The total set of possible faults associated with a sin- 
gle stage of a computer system. 

Fault Detector - A hardware or software process whose function- is to 
recognize the occurrence of certain classes or subclasses "of faults 
within a system. 

Fault Isolator - A hardware or software process whose function is to 
isolate a detected fault to the responsible LRU. 

Fault ■ Recovery Process - A hardware or sgftware process, whose function 
is to minimize or eliminate the potentially harmful effects of a fault, 
given that it is properly isolated, by means of appropriate remedial 
actions . 

Fault Subclass - A fractional portion of a fault class, distinguished 
from other subclasses within the stage by either its unique suscepti- 
bility, or lack thereof, to recognition and treatment by one or more 
fault detection, isolation and/or recovery processes. 
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FTF - (Fault Tolerant .Feature) - A hardware or software based process 
included in the system for the purpose of detecting, isolating or 
recovering from certain classes or subclasses of faults. 

Hybrid Channel - A fault tolerant computer system in which the com- 
ponent stages contain unequal quantities of on-line, LRU' s, thereby 
precluding division of the system into discrete channels, 

LRU - (Least Replaceable Unit) - The smallest system component which 
may be switchably replaced by an equivalent spare, in the event of 
its failure. 

Major Cycle - The time interval between repetitions of the total set 
of scheduled detectors (i.e., the time after which the total set is 
repeated) . 

Note: In CARE2, this value is calculated internally. 

Minor Cycle - The time interval between repetitions of the most fre- 
quent scheduled detector. 

Note: In CARE2 , all scheduled detec^tor periods must be integer mul- 

tiples of the minor cycle. 

Mode - One of the possible hardware configurations in which the sys- 
tem can operate. In general, mode M requires more on-line units than 
mode M+1, and is the preferred operating configuration. 

Operational Anomaly - The consequence of a fault which for reason of 
non-detection, incorrect isolation or unsuccessful recovery, is not 
remedied and therefore manifests itself in the issuance of improper 
control signals from the computer system. 

Parameter - A variable, either defaulted, input, or precalculated, 
which is meaningful in either the reliability or coverage model des- 
cription. 

Note: In CARE2, certain parameters, such as the inverse dormancy 

factor K, cannot be directly input by the user. 
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Parameter Vector - A one dimensional Fortran array which contains the 
values of a single parameter for all stages in the system. 

Permanent Fault - A fault which, because of its permanent nature, 
requires either the use, of a spare LRU or degeneration in the process 
of recovering from its effects. 

Run - The process of computing system reliability, versus time, for a 
given list of equation numbers (models) and a fixed set of model 
parameters. 

Run-set - One or more runs, with possible variation of one reliability 
parameter type (for all stages), .through the use of array PARAM. 
(Coverage parameters may not be varied in this way, for reason of 
the data base format structure.) Data base management operations 
take place between run-sets. 

Scheduled Detector - A fault detector, usually software, which is 
initiated at regular periodic intervals. 

Spare LRU - A single LRU contained in the spares pool. Equivalent 
to the term 'standby spare'. 

Spares Checkout - The process of testing a standby spare, for proper 
operation, during the interval between occurrence of a permanent 
fault and commitment of the particular spare as the replacement. 

Spares Pool - The total set of standby LRU's available for substi- 
tution in the event of permanent failure in one of the on-line LRU's. 
Note that substitution is conditioned by a requirement that both the 
failed LRU and its replacement be contained in the same stage. 

Spares Reassignment - The process of releasing the operational LRU's 
in a defunct channel to the spares pool. Given that the specific 
computer system design allows for reassignment, the process takes 
place at the time of degeneration from Mode M to Mode M+1. 
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Stage - The total set of identical LRU's, including both on-line and 
standby devices, contained in the system (cf. page 4-2). 

Standby Spare - A single LRU contained in the spares pool. Equiva- 
lent to the term 'spare LRU'. 

System - The full complement of hardware and software comprising a 
fault tolerant computer configuration. In CARE2, the complex of 
coverage and hardware reliability data associated with a single exe- 
cution run defines the current system. 

TMR - (Triple Modular Redundancy)- A fault tolerant computer system 
consisting, initially, of three parallel and active channels in ser- 
ies with a majority voting element. In the event of a singular chan 
nel failure, the voter maintains correct system outputs. 

Transient Fault - A fault which, because of its temporary nature, 
requires neither the use of a spare LRU nor degeneration in the 
process of recovering from its effects. 

Transition - The process of switching the computer system operating 
mode (from Mode M to Mode M+1) . 

Unit Time - An arbitrary time interval, as for example seconds or 
hours, used in conjunction with failure rates, detection rates, reli 
ability, etc,. Although consistent units must be utilized within a 
model (i.e.. Reliability or Coverage), there is no need to use com- 
mon units in separate models. 

Unscheduled Detector - A fault detector, usually hardware, in which 
the detection process is triggered either directly, or effectively 
in outward appearance, by the occurrence of a fault. 



2-33 



Page Intentionally Left Blank 



RAYTHEON COMPANY 

EQUIPMENT DIVISION 

SECTION 3 

MATHEMATICAL MODELS 

The total system reliability of a redundant computer configuration 
can be expressed mathematically in a number of ways. These expressions 
all tend to be rather cumbersome, however, and therefore one of the 
objectives of this study was to devise formulations that can be evalu- 
ated. efficiently by computer, A second, and equally important goal, 
was that the formulations be of a sufficiently general nature as to 
provide considerable design flexibility to the user. The following 
sections present both the end result of these efforts and, as an inter- 
pretation aid, a listing and definition of all parameters, 

3,1 DUAL MODE RELIABILITY MODEL 

The Dual Mode Reliability Model subprogram, with respect to initial 
plans, has been extended in scope significantly such that it now includes 
not only dual channel capability, but a variety of others as well. These 
include both N channel (where N is a user specified integer) and Hybrid 
configuration (where the quantity of active units differs between stages) , 

The reliability equations utilized in encoding the model, as well 
as a functional description of each, are presented in this section. 
Subscripts and other simple variables used therein are as follows: 

X - Refers to the set of LRU's which constitute a particular 

stage of the reliability model 

y - Refers to the operating mode 

t - Either the independent time variable or a dummy time 

variable (made clear by context) 

T - A dummy time variable for integration or function referencing 

i,j,/ - Dtimmy integer variables for counting or function referencing. 
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3.1.1 G: 


Fortran DCG { IUN,MD, I , T) - "The probability of using 
exactly i spares by time t." 

e-‘VyX . .( 




-(m )t\i ^ -(Q ) (X^)t 

■e X I * e x,y x 


G(x,y,i,t) =4 for K< • 


-(7 )t *(^°x,y^ ^^x,y^ * -(Q ) (X )t, 

e x,y * \ ^ f * e x,y x 


i: 


for K= • 


This function expresses the probability of using exactly 
i of the available spares by time t under the following assumptions : 

• Q LRU' s are required, with standby and active LRU' s 

x,y 

having the constant failure rates and , 

respectively. 

• The coverage factor, given that the £th spare tested, 
following a detected failure is the first operational 

S - 1 

snare, is C . (Note: coverage is defined here 

, X, y X, y 

as the conditional probability of successful recovery 
given that an error occurs and sufficient hardware is 
available. If recovery is possible, the system oper- 
ational mode is determined solely by the remaining 
available hardware.) 

• Non-recoverable transients occur at the rate V ^ = y’ „ * 

^91 ^91 

(1-P ) per second. In the case of recoverable tran- 

ients, the system resumes operation in the same mode as 
was in effect prior to the transient. 
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The first term in the above expression, then, is the 
probability that no non-recoverable transients occur by time t. 

The product of the remaining terms is the probability that exactly 
i LRU' s sustain failures by time t, either while operating or 
prior to the time they are needed. (Notice that a failure in 
a spare LRU is not counted here until that spare would have 
been used following a failure in an operating LRU.) To verify 
this, we observe that the reliability of an r-on-m configuration 
(r spares on m operating elements) having active and standby 
failure rates X and n, respectively, is equivalent to that of 

an r on Xn\/M configuration having a constant failure rate m for 

* 

both active and standby elements. Further, since with probability 

C /6 , the system can recover from a failure on the condition 

x*y/ x,y 

that it can also find and successfully activate a standby spare, 

the rate of such recovered failures is X C jb , and hence the 

X x,y/ x,y ^ 

effective number of active elements is K Q C / b = M. The 

X x,y x,y/ x,y 

probability that exactly i spares are used by time t is thus the 
product of the probability 

that exactly i recoverable failures occur in the first M+i-1 
LRU' s, the probability e ^x^ that the i^^ spare is operational, 
the probability 6^ that i spares are successfully tested, and 
the probability 

Q |i-c /a V t ^ 

>^»y\ x»y/ x»y/ 5C,yJ that no non- 
recoverable failures occur during this period. 




* c.F. J. J. Stiffler, "on the Efficacy of R-orv-M Redundancy", 
IEEE Trans, on Reliability, (to appear). 
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h- 


3.1.2 R : Fortran DCRU (lUN, MD, L, TAU) - "The probability of 

u 

using at most H spares by time t 

G(x,y,i,r) 

i=0 


The probability of using at most ji spares is clearly the 
sum of the mutually exclusive probabilities that exactly i spares 
are used, for i = 0, 1, . . . , i . 


3.1.3 H: Fortran DCH(IUN,TAU) - "The probability density of a 

degenerative failure in stage x at time t ." 


(S ) 

X 




i=0 


H(x,r) = ^ for K<«» 


X Q ,(C' )* G(x,y=l, (S ) ,t) , 

X x, 1 X X 


for K=<* 


The conditional probability density of a degenerative 
failure at time t, given exactly S -i prior failures, is equal to the 
product of the probability density ^^X^ of a failure in an oper- 

/ - M T i . . 

ational unit and the probability (1-e x ) that the i remaining 
spares have failed by time t , times the probability c' (S' 
that it is possible to recover from this event. The product 
of this conditional probability density with the probability 
that exactly S -i failures have occurred by time t , summed over all 
i = 0.1,..., S , is thus the desired probability density. 

X . 
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3.1.4 S; Fortran DCS (lUN, T, TAU) - "The conditional probability 

that the set of units in stage x can remain operational 
in mode y = 2, from time t to time t, given that the 
mode changed from 1 to 2 at exactly time t ." 


(S ) 

X 



z 

j=0 


( 






e”3(Mx)T * G(x,y=l,/,T) * (x, y=2 , j (t-r)) 


S(x,t,r) = < for K<» 


(S ) 

X 

^ ^ G(x,y=l,/, t) * (x,y=2, (t- 

£=o 

for K=« 


r)) 


where j 


j, no spares reassignment 

j+Q ,-Q reassignment allowed 

X f y®* 1. X / y* z 


K'-'] • 


K'-‘ , no spares reassignment 
(S )-i|+Q T ”Q reassignment allowed 

Lx J X / J- X f ^ 


3-5 




RAYTHEON COMPANY 

^AYTHEO^ 


E QUIP MENT DIVISION 




The product of the first three terms in the summand 
is the probability that exactly j of the unused spares 

are still operational at time t . The fourth term is the proba- 
bility that exactly £, spares have been used in mode 1 by time t , 
and the fifth term the probability that the remaining j' spares 
are sufficient to keep the unit operating in mode 2 for the next 

(t-r) seconds. The sum over all i= 0 , 1 ,..., S and j = 0,1,... S - H 

X 5c 

is thus the probability of the event described. 


T: Fortran DCT(IUN,J) - "The probability that the system 

will survive until time t, and that a failure in the 
set of units in stage x will have forced degeneration 
from mode y = 1 to mode y = 2." 



(X, t ,T ) 


* R^j(x,y=2,r, (t-r)) * H(x,t) 


e 2 ] 


dr. 


T(x,t)= 


for Q(x, l)>Q(x, 2) 

0 , 


for Q(x, 1)<Q (x, 2) 


The integrand is the product of the probability density 
of a failure in unit x at time r (3rd term) , the probability 
that the unit functions successfully for the next t-r seconds 
in mode 2 (2nd term) , the probability that all other units remain 
in operation until time r in mode 1 and from time r to time t in 
mode 2 (1st term), and the probability that no category-two fail- 
ures occur during this time (4th term) . Since a degenerative 
failure can occur any time during the interval 0<T<t, T(x,t) is 
the integral of this density function over that interval. 
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3.1.6 * Fortran DCT2 (J) - "The probability that the system 

will survive until time t, and that a category 2 
switch failure will have forced degeneration from 
mode y = 1 to mode y =-2." 

T 2 (t) = (\^) (C ^) f ^|s(x,t,r) j * e“^^2^^ 

Here the 1st term in the integral is the probability that 
all units function successfully in mode 1 until time t and in 
mode 2 from time t to time t. The product ^2^2^ ^2^ is the prob- 
ability density of a category-two failure at time t times the 
conditional probability that recovery from such a failure is 
successful. The integral of this quantity over all t, 0<T<t, 
is therefore the result sought. 

3.1.7 R: Fortran DCR(J,UNITR, RELMl, RSYS) - "The system reliability 

at time t." 

R(t) = r n jR^^(x.y=l,S^,t) I * + T^Ct) + ^ jT(x,t)jj 

Lx 

The first term here is the product of the probability that 
each of the units function until time t in mode 1 and that no 
category- two failures occur during that time. The remaining 
two terms represent the probabilities that the system successfully 
survives until time t, with a degeneration occur ing sometime 
during that interval because of a category-two switch failure, 
and because of a degenerative failure in one of the units, 
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respectively. The sum of these. .three .terms multiplied by the 
probability that no category-three failures occur during the 
period in question is thus the probability that the system 
operates successfully over the entire t-second interval. 


The function also computes system and stage reliabilities, 
assuming mode 1 operation alone, and returns the values as RELMl and 
array UNITR, respectively. The expressions for these values are sub- 
sets of the above equation; 


R(t,y=l) =|^n R^(x,y=l,S^,t)J ^-(^2+^3)t 

R (t,y=l)= R (x,y=l,S ,t) 

X u X 

3.2 COVERAGE MODEL 

The function subprogram COVAGE, in concert with certain other 
routines, is utilized to calculate coverage for application in both 
the dual mode reliability model and original CARE equations 2 and 3. 

This section presents the equations utilized therein, as well as 
corresponding functional descriptions. 

The basic coverage calculation described herein returns a single 
value C(s) corresponding to a specific system operating mode, fault type 
(permanent , or transient), quantity of spares tested, and fault sub- 
class, The calculative process is then iterated, via separate calls 
to COVAGE, for each combinatorial set of conditions. 

Since the ensuing reliability model accepts coverage values at 
the stage rather than subclass level, it is necessary to preprocess 
the above according to the relation 


C(I) -^(d.c.) 


where C(I) is a coverage factor 
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of Stage I*, and is computed as an average of the factors returned, 
as C(s), by COVAGE (cf. section 3.2.2), weighted by the fractional 
fault occurrence rate d^^ for each fault subclass <r associated with the 
stage. The association of stage and fault subclass is specified by 
the user when selecting input values for the linkage variable 
(IFSC) , and the relative fault rate d^ (FRAG) , for each subclass used. 

In turn, the required delta coverage inputs (representing the 
diminished recovery probability arising from trial repair with a 
previously failed spare) , are computed in accordance with the 
expression 



«(I) = C(l,s=2j/C(l,s=l) 


where s is the quantity of spare LRU's which must be trial tested 
during the recovery process (i.e., s-1 having failed the test). 


In perspective then, it should be noted that the following 
coverage and delta coverage values are computed, and delivered to 
the reliability model, for each stage of the object computer system: 


Coverage - C(I) 
o 
o 
o 


o 

o 


^x,y=l 
, y=2 

^r , 
p x,y=l 

r 


’x,y=2 
Delta Coverage - 
o 8, 
o 
o 


I ^ 

x,y=l 


x,y=2 

8 ' 

X 


permanent fault, 
permanent fault, 
permanent fault, 
transient fault , 
transient fault. 


continued Mode 1 operation 
continued Mode 2 operation 
transitional from Mode 1 to 2 
continued Mode 1 operation 
continued Mode 2 operation 


8 ( 1 ) 


permanent fault, 
permanent fault, 
permanent fault. 


continued Mode 1 operation 
continued Mode 2 operation 
transitional from Mode 1 to 2 


¥ Note that, in the reliability model, xwas used to denote a stage. 
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3.2.1 Subscripts and Simple Variables 

t,T,a,v,T] - dummy variables for integration, summation and 
substitution. 


i, j ,k, i 


detector number. The choice of variable is 
determined as; 

i - general-purpose detector symbol (used on 
either side of an equation) 
j - scheduled detector symbol (right side of 
equation only) 


k - non-scheduled detector symbol (right side 
of equation only) 

- non-scheduled impulse detector symbol (right 

side of equation only). Specifically, 

, . . . , >2 comprises the set of non- 
z m . 

1 

scheduled impulse detectors, excluding the 

i , which have the same delay (t_ ) as de- 

d . 

detector i, given m^ such detectors. 


3.2,2 C: Fortran COVAGE - "The sum of coverage contributions of 

all competing d/i/r mechanisms given that s spares must 
be checked prior to recovery . " 


C(s) 



i 

The coverage value C(s) associated with a single fault sub- 
class under given spares status and computer system operating con- 
ditions is clearly the summation of individual d/i/r mechanism 
contributions (C(i,s)). over all i. 
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3.2.3 C(i,s): Fortran COVAGE - "The coverage contribution of 

the D/i/r mechanism associated with detector i, 
when all competing detectors are accounted for and 
• when s spares must be checked during the recovery 
process. " 

00 00 

C(i,s) = P^P'j^Pg J gi(’-')r'^(T') y* h^(r"-Sr^)r"(r '+r")dr"dr' 

0 0 

" *" th 

The detection probability density function for the i detector 

is p^g^ ( t ' ) and the associated isolation density function is, by 
definition, pj^h^(T"). if s spares must be checked in order to re- 
cover successfully from a fault, the overall recovery probability 
is decreased by the factor p (with p the probability of success- 
fully checking out a spare) and the isolation delay is effectively 

increased by the factor st . The term C(i,s) is thus equal to the 

s 

conditional probability that the system can still recover given a 
t' -second detection delay times the detection probability density 
function, multiplied by the conditional probability that the system 
can recover given that it has survived a t' -second detection delay 

I 

and must in addition undergo a total of (t'+t") seconds down-time, 

times the corresponding isolation density function, the whole 

. . s 

thing integrated over all x' and x", and multiplied by p . 

s 

3.2.4 g.(x): Fortran CVGS, CVG2 and CVGl - "The conditional 

^ th 

detection rate of the i detector when in compe- 
tition with all other detectors." 

til 

The function g^(x) represents the i detectors' conditional 
detection rate when competitive processes are taken into account. 

In turn, the corresponding detection probability density function 
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is then expressed as p.g. (t). The rate itself is determined by 

4* Vi 

using one of three equations depending on the nature of the i^ 
detection process, i.e., whether it is scheduled or unscheduled 

and, in the latter case, whether it is a finite or impulse detector. 

> 

• Scheduled g^(T): Fortran CVGS 

n [l-p^p^(.-t )] ,■.(.), for 0 

/C iC " 

0, otherwise 


g^(^) = 


The product over k represents the probability that none of the 
non-scheduled detectors has detected the fault by time t (note that 

F (t) = 0 for all t < 0) . The function gI(T) is the conditional 

k th ^ 

detection rate of the i scheduled detector when competing with 

the other operative scheduled detectors only (see Section 3.3.4). 

• Finite Non-Scheduled g^(T): Fortran CVG2 


g. (t) 


ri [i- 




>] * Pj / 




The product over k is the probability that none of the other 
non-scheduled detectors has sensed the fault by time t. Similarly, 
the second term is the probability that none of the scheduled de- 
tectors has succeeded by time t. The last term is, of course, the 
detection rate of the detector in question in the absence of any 
competition. 
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• Non-Scheduled Impulse g. (t) : Fortran CVGl 



g^(T) = 


m. 


Epv^---p^. 


v=l 




n -t )1 

kk d. dj^J 

k;^i , i , . . . i 
12 m . 

1 


-?»./* 


g:(Tl)dii 


* 1^0 ‘■"-■'d > ' 
i 


for f^(T) = ) 

i 


where I^q(x) is a unit impulse at x = 0 


This expression has the same interpretation as the previous 
one except that the f^(T) is an impulse function. The first term 
assumes a different form here, reflecting the fact that if v 
impulse detectors having simultaneous delays all detect the 
same fault, only one of them is declared the winner. In this 
event, it is assumed that each of the successful detectors has the 
probability l/v of being declared the victor. The first term in 
this expression, then, is the conditional probability that the 
impulse detector in question is declared the winner over its simul- 
taneously occuring impulse competitors, given that it is, in fact, 
successful in detecting the error. The second term is the proba- 
bility that none of the other non-scheduled detectors finds the 

error prior to time t, . The remaining terms are as previously 

d . 

defined. 
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3.2.5 g!(T): Fortran CVGP4 and CVGP3 - "The conditional detection 

til 

rate of the i scheduled detector when in competi- 
tion with all other scheduled detectors." 

til 

The function g|(x) represents the i scheduled detectors' 
conditional detection rate when competitive scheduled processes 
only are taken into account. It is determined using either of two 
equations depending on the nature of the i^^ detection process, 
i.e., whether it is. a finite or impulse detector. 

• Finite g^(T): Fortran CVGP4 


+ — ^ ^ — f n [l-p.F^Cn-r+n.T +t -t^)l 

n.T n n.T J ■ /{L 3 3 i mr d. j j 

1 mr c 1 mr 37^1 i 


n /n. 
c 1 


(r) =( 


for 0 ^ T ^ At . 


n /n . 
c 1 


•At. 


— ^ f n [l-p ■F'f (’7-r+n.T +t -ff)] 

"c “ "i^mr J j^i*- ^ ^ 1 mr d. : J 


* f^ (’7)d’7 


for At . < T < n .T 

1 — 1 mr 


0, otherwise 


The first term in the expression for g! ( t) when 0 4 t 4 At . 

1 th ^ 

represents the conditional detection rate of the i scheduled de- 
tector given that the error occurs sometime during the period dur- 
ing which that detector is active (e.g., while the associated 
diagnostic program is running) . 




RAYTHEON COMPANY 
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Note; It is assumed here, and throughout this discussion, 
that the probability that an error is detected by a given detector 
during a given interval is equal to the integral of its probability 
density function over that interval. Furthermore, each portion of 
a scheduled test is assumed to have a probability of detecting a 
particular fault only during its first exposure to that fault. 

Thus, for example, if a fault occurs when a given scheduled program 
of Tq seconds duration is in progress and has t^ seconds left to 
run, and if the fault is not detected during those t^ seconds, the 
fault has a chance of being detected only during the first T^-t^ 
seconds of that program when it is next run, and if it is not de- 
tected then, it will not be detected by that program during any 
subsequent runs. 

th 

The I term in the summation in both expressions for gj^(T) 

(i.e., for 0 < T < At . and for At. < t < n.T ) is the conditional 

1-^1 mr 

detection rate of the i scheduled detector, given that the fault 

th. 

occurs either during or following the I scheduling of that de- 
tector, but is not detected during that scheduled run. According- 
ly, the product denotes the probability that none of the other 
scheduled detectors exposed to the fault during the t seconds prior 
to its detection is successful in detecting it. This product, 

multiplied by the detection rate function f . (T|) associated with 

th ^ 

the i detector and averaged over all T) gives the desired result. 

Summing these conditional detection rates over a major cycle then 

til 

yields the desired detection rate for the i scheduled detector 
when competing only with other scheduled detectors. Note that 
there are exactly n^/n^ repetitions of the i^^ detector during one 
major cycle. (A major cycle is defined as the overall period of 
the combined scheduled tests.) 
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• impulse g^(T): Fortran CVGP3 


n /n. 
c 1 


n 


g'i(0 


/La n.T 

£=1 


n 

i"^mr j^i 


~l / i 

1-p .F . i-T+n.T +t, -t . 
^3 3\ 1 mr D 


for 0 < T < n . T and f . ( ) = M^t . 

1 mr X 0 d . 

1 

0, otherwise 


This expression is simply a specialization of the previous 
expression for the case in which f^(T) represents a scheduled 
impulse detector (i.e., f^('r) is an impulse function). 
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3.3 PARAMETER DEFINITIONS AND APPLICABILITY 

The quantity of parameters contained in CARE2, with reference to 
the original CARE program, has of necessity increased rather signifi 
cantly. In particular, the total set now consists of: 

• CARE/CARE2 compatible parameters - Variables retained 
from the original CARE program for use in CARE2 

• Dual Mode Reliability Model parameters - Variables which 
were introduced to allow implementation of Equation #7 

• Coverage Model parameters- Variables which were introduced 
to allow calculation of coverage. 

Table 3-1 lists the major parameters currently utilized in both 
the models and their Fortran implementation, and denotes their applic 
ability with respect to both coverage and each of the seven basic 
reliability forms (i.e., equations 1-6 of original CARE and equation 
7 as added in CARE2). In addition, it provides a categorization of 
each with respect either to its source or its use as a model output. 

These same parameters are then defined, in Table 3-2, in the 
form a, A (N) , alpha, where: 

• a - is the symbol used in analytic expressions 

• A(N) - is the Fortran mnemonic, with subscripts where 

applicable , 

• alpha - is the parameter name. 
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TABLE 3-1 

PARAMETER/MODEL APPLICABILITY 


Parameter 

C 

C ' 

C_ 


6 

s' 

F. iv) 

X 

1 

F^(’7,i) 

f. (r) 

1 

g^(^) 

g: (O 

h. (r) 

X 

7 

7' 

lET 

IGENC 

IGENP 

K 

La 

7 

3 X 


Coverage 

Model 


R 

R 

C 

I 

R 

R 

C 

C 

C 

I 

C 

C 

I 


s 

I 

I 


Reliability Model/ 
Equation Number 


2 

I* 


C 

I 

I 


I* 


C 

I 

I 


I* 

I* 

I 

Note 1 
I Note 1 
I* 

I* 


C 

I 


I Note 1 

I 

C 


cf. continuation sheet 2 of this table for explanatory symbols 
and notes. 


3-18 




R AYTH EON 

EQUIPMENT 


COMPANY 

DIVISION 



Coverage 

Parameter Model 


\3 

MD 

M 

N 


n 

c 

n . 

1 

P 


Pi 


Pi 


Pifi(-) 


p|h^(r) 


P 

r 


Q 

R 

r^(T',T") 

r^(T') 

r^(r' + T") 

RSGN 

RV 

S or r 


S 


C 

I 


I 

I 

C 

c 

R 

I 


c 

I 

I 


TABLE 3-1 (Cont.) 

Reliability Model/ 
Equation Number 


I I I I 

I - - - 


I 


I I 


R R R 


R 


R 


R 


I I I I I 

I I I I - 


2 

I 

I 

S 

I 


I* 


I 

R 


I 


I 
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Parameter 


TABLE 3-1 (Cont.-2) 


Coverage 

Model 


Reliability Model/ 
Equation Number 


s 


a 



At. 

1 

t, +At. 
d . 1 

X 


T. 

1 



T 

mr 


W 


S 

s 

I 

I 

c 

c 

c 

I 

I 


- Note 1 


I I I I I I 


Z 


I I I I I I 


C - computed by program 
I - input by user or default table 

I*- input by user, default table, or coverage model 
R - result (i.e,, output) of model 
S - subscript used for summation, selection, etc. 

- - not applicable 

Note 1: These parameters are used to specify the relation- 

ship between coverage subclasses and reliability 
stages for equations 2, 3 and 7. They are included 
in the table for reason of competeness although, as 
a consequence, the relationship to column headings 
is somewhat strained. 
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TABLE 3-2 

PARAMETER DEFINITIONS 


Fortran 

Symbol Mnemonic 


C C1(I), 

C2(I) 


C* CTR(I) 


CCSF 


Name Definition 

Coverage The conditional probability, 

factor for mode equals 1, 2 respect- 
ively, that the system can 

recover from a permanent hard- 
ware failure in an LRU of 
stage I, given that sufficient 
spare hardware is available. 

Transitional The conditional probability 

coverage factor that the system can recover 

from a permanent failure in 
an LRU of Stage I, given that 
no spares are available and 
hence that the system must 
degenerate from Mode M to 
Mode M+1. 


Switch failure The coverage factor associated 
coverage factor with recovery via degeneration, 

due to a category 2 switch 
failure. 


Ca 


d 


(7 


6 


COVAGE 


FRAG 


CDl(I) , 
CD2(I) 


COVAGE The coverage factor for fault 

returned subclass a under a given set of 

valve conditions i.e., given values 

of MD, lET and JS in the COVAGE 
argument list. 

The fraction of class (i.e., 
stage) faults which are attrib- 
uted to member subclass a. 


Delta coverage A term defined by the equation 
factor C,=C6 , with C, the conditional 

probability, for Mode equals 1, 
2 respectively, that the system 
can recover given that the 
first i of i+1 or more spares 
has failed in the idle state, 
that the next has not, and 
that recovery is initially 
attempted utilizing the i 
failed spares. 
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Fortran 

Symbol Mnemonic 

s' CDTR(I) 


Name Definition 

Delta transitional A term, defined by the e^ua- 
coverage factor tion C ^=C (6)^/ with C 

the conditional probability 
of recovery given that all 
i of the remaining spares 
have failed and that all i 
are tested to ascertain 
this prior to degeneration. 




Cumulative detec- 
tion probability 
function 


The integral of the function 
f,(r), i.e., the probability 
that an ideal detector will 
detect an error within ^ 
time units after its i 

occurrence. Note; It is 
recommended that the user 
formulate the integrals for 
any general function models 
created, and provide these 
as associated integral models . 


-I 

F^(r?) = 

(l-F^. (r,) ) 


Probability of non-detection 
1 ^ 

(with F . (t? ) = F . (^ ) ) . 

1 1 




f . (r) 
1 


The probability that an 
ideal scheduled detector 
(with a detection rate iden- 
tical to that of detector j) 
will detect a fault within 
V time units after the ini- 
tial delay, when detector j 

is scheduled to begin t^(i) 
time units after the occur- 
rence of a fault. 

The user specified conditional 
detection rate of detector i 
in the absence of any com- 
petitive detection processes . 




CVGS, 

CVGl, 

CVG2 


The conditional detection 
rate of the i^^ detector 
when in competition with 
all other detectors, \ 
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Fortran 

Symbol Mnemonic Name Definition . 


(t) CVGP3, 
CVGP4 


h.(T) - Isolation 

^ rate 

function 


The conditional detection 
rate of the i^^ scheduled 
detector when in competi- 
tion with all other sched- 
uled detectors. 

The user specified isolation 
rate of isolator i. 


7 


7 ' 


GMI(I) , 
GM2 (I) 

GMP (I) 


lET 


System transient The product 7'. (1-P ) for 

failure rate a unit of stage I, and for 

mode equals 1,2 respectively. 


Transient failure The rate at which transient 
rate hardware errors take place 

in on-line units of stage I. 
Expressed in failures per 
unit time. 


Fault type code A code which specifies the 

"duration" of a fault. 

Equal to: 

- 1 for permanent faults 

- 2 for transient faults. 


IGENC(I) Coverage calcu- 
lation flag 


An integer variable which 
conditionally specifies the 
source of permanent fault 
coverage factors for stage I: 

- for IGENC<0, coverage 
is calculated prior to 
each run-set 

- for IGENC=0, coverage 

is not calculated (i.e., 
either input or defaulted) 

- for IGENC>0, coverage 
is calculated prior to 
the first run-set only. 

Note: A negative coverage 
value input by the user 
serves as a higher prece- 
dence command, and forces 
calculation of a "replace- 
ment" value, given only 
that coverage itself is 
applicable to the corres- 
ponding reliability model. 
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S vmbo 1 


Fortran 

Mnemonic 


Name 


Definition 


IGENP(I) .Transient recovery 
calculation flag 


An integer variable similar 
to IGENC , except that tran- 
sient fault recovery proba- 
bilities (P ) are the can- 
r 

didates for calculation. 


K 


K(I) 


Inverse dormancy 
factor 


The ratio of \ //^ for LRU*s 
of stage I. 


IFSC 


Fault . 

class 

indicator 


A pointer or linkage to 
fault subclass <r , wherein the 
subclass accounts for a frac- 
tional portion (d^) of the 
faults which occur in stage 
L- of the system. 


LAM(I) On-line failure 

rate 


The rate at which permanent 
hardware failures take place 
in on-line units of stage I. 
Expressed in failures per 
unit time. 


Z X 


SUMLAM 


Simplex on-line 
failure rate 




SLH2 


Category 2 switch 
failure rate 


/N 

3 


SLH3 


Category 3 switch 
failure rate 


The sum of on-line failure 
rates over units of all 
stages, for the purpose of 
calculating simplex reli- 
ability. 

The occurrence rate of perm- 
anent hardware failures which 
cause degeneration from mode 
M to mode M+1, Expressed in 
failures per unit time. 

The occurrence rate of perm- 
anent hardware failures which 
cause total system failure. 
Expressed in failures per 
unit time. 


MD 


Mode 


The computer system oper- 
ating mode. In the Dual 
Mode model, modes are dis- 
tinguished by the quota of 
on-line units for each stage. 
In the Coverage model, D/I/R 
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Fortran 

Symbol Mnemonic Name Definition 

processes are defined sep- 
arately for modes 1 and 2, 
as well as dummy mode O. 

MD is equal to: 

- 1 for full up 

- 2 for degenerate 

- 0 for transitional 
(coverage only, during 
transition from MD=1 to 
MD=2) . 




N 


n 

c 


n . 

1 


P 


Pi 


MU (I) 


N(I) 


LCM 


IREP(I) 


P(I) 


Stand-by failure The rate at which permanent 
rate hardware failures take place 

in stand-by LRU's (spares) 
of stage I. Expressed in 
failures per unit time. 


Modular redundancy The number of identical 
factor active units in a fully 

operational NMR system. 

The least common multiple 
of the ni's where ni is the 
repetition factor (where 
applicable) of the ith 
detector, in terms of minor 
cycles. 


Repetition factor 


The integer ratio of the 
repetition period of the 
ith scheduled detector to 
the minor cycle duration. 


i.e. , 


T . =n . T 
1 1 mr 


The probability of stage I 
failing' with a logical zero 
output. 


The conditional probability 
that the it^ detector will 
detect an arbitrary fault, 
given infinite time and no 
competitive detection 
processes . 


The probability that the 
ith isolator is able to 
isolate a fault. 
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Fortran 

Symbol Mnemonic Name 


Definition 


p. f . i'r) 

1 X 


p.h. (t) 
X X 


P 


r 


— — The detection probability 

density function of detec- 
tor i in a non-competitive 
environment (i.e., when no 
other detectors are opera- 
tive) . The value of p. and 
the function f. (r) are 
defined explicitly, as is 
the initial detection delay 
"^3. , The "rate function*’ 

' X 

f^(T),then, represents the 

rate of detection for an 
ideal detector, i.e., one 
which guarantees detection, 
given an adequate amount of 
time. Since t ^ is explicit, 

f^(r) must be defined so 

that f^ (0*^) 7 ^ 0, The dur- 
ation of the function is 
defined as the value of r 
after which detection can- 
not occur. 

— — The isolation probability 

density function of the 

process associated with 

detector i. Note also h^(r) 

is a rate function defined 

similarly to f.(^). 

1 

PRCl(I), Transient recov- The probability, for mode 

PRC2(I) ery probability equals 1, 2 respectively, 

that the system can recover 
from a transient fault 
occurring in an LRU of 
stage I. 
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Symbol 

Fortran 

Mnemonic 

Name 

Ps 

PFDS(ISU) 

— 


Definition 

Probability of detecting a 
failure in a failed spare 
during checkout. Note that 
this has the effect of di- 
rectly reducing the final 
coverage result, since no 
provision is made for sub- 
sequent recovery from unde- 
tected failures in spares. 

The value selected must 
therefore account for all 
secondary recovery capabil- 
ities of the computer system. 
Note also that the value is, 
for reason of its inclusion 
in the coverage model, as- 
signed to a fault subclass 
rather than an LRU. 


Q Q1(I),Q2(I) Quota The number of on-line LRU's 

required in stage I for sys- 
tem operation in mode equals 
1, 2 respectively. 

R (t) R Reliability The probability that the 

system is operational at 
time t given that it was 
operational at time 0. 


r. (r *,r") 
1 


r|{r') 


Recovery 

probability 

function 


Fault pro- 
pagation 
recovery 
function 


The conditional probability 
of system recovery, given 
detection and isolation de- 
lays of r' and t", respect- 
ively. 

r. (T* ,r")=r* (r') . r'.’ (r'+r") 

1 i i 

The conditional probability 
of system recovery, given 
detection time ^ ' for the i^^ 
detector at the end of which, 
fault' propagation ceases. 
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Mnemonic Name Definition 


Time lost recovery 
function 


RSGN Reass ignmen t 

flag 


RV (I) Restoring organ 

reliability 


S(I) Spares 


JS 


ISU Fault subclass 


The conditional probability 
of system recovery, given 
total detection and isolation 
delay of (r* + r**) for the i^^ 
detector, isolator pair at 
the end of which, fault re- 
covery is initiated. 


A logical variable which 
for RSGN 

=.TRUE., enables opera- 
tional LRU's in a failed 
channel to be reassigned 
i.e., released to the 
spares pool 

=. FALSE. , precludes 
reassignment 

(Note: RSGN must be set 
true if the quota in mode 
M+1 is greater than in 
mode M) . 


An overall limiting prob- 
ability of success applied 
to stage I. 


The number of spare LRU's 
available in stage I at 
time t = 0. 


The quantity of spare LRU's 
which must be checked out 
before recovery can proceed, 
(normally, 0 or 1) . 

An integer subscript 
(0<ISU<8) which identifies 
a set of competitive D/I/R 
processes (i.e., a fault sub- 
class) . One or more such 
sets may be assigned to a 
" single stage of the relia- 
bility model. 
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Fortran 

Symbol Mnemonic Name Definition 

t^ TDEL Delay time The delay time associated 

“i with initiation of detec- 

tion process i. For sched- 
uled detectors, it is mea- 
sured relative to the begin- 
ning of a major cycle; for 
non-scheduled detectors, it 
is the interval between oc- 
currence of a fault and the 
instant it can first be 
detected. 


Ati TDUR 


Duration time The time interval during 

which a detection function 
is non-zero. 


t^ +At. 
d , 1 

1 


T. 

1 


Finishing time for detector 
i, i.e., the time relative to 
the start of a major cycle 
(scheduled detectors) or the 
occurrence of a fault (non- 
scheduled detectors) after 
which detector i is not 
effective. Note: 


<*> At . 

J" f^(T)dT = J" f^(T)dT = 1.0 


.(delay is external to 
function) 

The repetition period of the 
i^^ scheduled detector. 


tJ?(i) CYTJL(J,L,I) — Largest solution t to the 

3 equation 

t = t^+i-T^-(/-l)T^, u =0,1,2, 

. . . , in the range 

(t.,t.+T.). (If there is no 
111 
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Fortran 

Mnemonic Name Definition 



solution in this range, F^{v) 

9 ^ 

= 1. ) Thus , t . represents 


the starting time of the last 
occurrence of the scheduled 

test in the interval 




TMINOR 


Minor Cycle The minor cycle duration, i.e., 

the greatest common divisor of 
the repetition period T^. 


TFDS (ISU) Checkout 

Time 


The average on-line time re- 
quired to test a single spare, 
given an accompanying test 
success probability of p . 

Note that the value is, 
for reason of its inclusion 
in the coverage model, 
assigned to a fault subclass 
rather than an LRU. 


L 

W(I) Division factor 


The number of identical sub- 
units which comprise one unit 
in stage I. The sub-unit 

failure rate X thus relates 
s 

to the unit failure rate X 
as X = X A?. 

S 


Z (I) Iteration factor The number of identical units 

operating in series which 
comprise stage I. The reli-- 
ability of the stage is thus 
the product of the reliabili- 
ties of the Z sub-stages . 
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SECTION 4 
SOFTWARE 

CARE2 was developed using the CDC RUN76 compiler, under the 
KRONOS 2.1 Operating System, on the CDC 6700 Time-sharing Computer 
at Raytheon MSD, Bedford, Massachusetts. The program is essentially 
UNSI Fortran IV, with the exception that, like original CARE, it is 
designed for execution on a 60 bit CDC computer. The field length 
required for the complete program version is less than 130K (octal) , 
and for the reduced version (without plotting options) less than lOOK. 

Program modules added were' designed to be of comparable scope 
with existing modules as far as programming efficiency and run time 
considerations would permit. The statistical, mathematical or imple- 
mentation significance of each revised or added subprogram is described 
in this section. 

4.1 PROGRAMMING CONSIDERATIONS 

The task of including both a dual mode reliability model and 
a model for the calculation of coverage factors, was originally ex- 
pected to entail additions to CARE only. Ideally, the former would 
replace dummy subprogram NEQ7, and the latter would require one ad- 
ditional CALL statement in the main program, plus the implementing 
code. However, the relatively complex method of passing statistical 
parameters- to the reliability models, combined with the need for com- 
munication between stages of the dual mode model, disallowed this 
ideal. Simply stated, the CARE program requires 1) that an equation 
use at most one parameter of each type, and 2) that individual systems 
be evaluated serially. Thus major revisions of the data base structure 
were undertaken. The possibility of expanding the existing structure 
to accommodate a large number of additional parameters was rejected 
in favor of a simpler format. 
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In CARE2, a computer system is represented as a series of one 
or more stages, each of which is comprised of identical subunits 
including one or more optional spares. Each invocation of equations 
1-6 corresponds to the definition of a single such stage, whereas 
equation 7 has internal provision for representing up to eight stages. 

For the group of up to 10 stages which can be modelled at one 
time, there exists a base run vector (i.e., an array of dimension 10) 
for each parameter type. In addition, any one of 19 parameter types 
may be selected for variation (cf . sheet 2 of Table 4-2) , which is 
accomplished internally by alternately refreshing the selected run 
vector from a single 10 x 16 iteration array and evaluating the system. 
(The latter process, consisting of up to 16 evaluations or runs, is 
termed a run-set.) Optionally, run-sets may also be repeated, follow- 
ing user changes to run vectors, iteration data and iteration parameter. 

Coverage factors are treated the same as other parameters when 
input directly by the user. Conversely, however, they can be computed, 
for equations to which they apply, provided that separate coverage 
input data is supplied. These inputs, because of their relative diver- 
sity, cannot be varied as described above. Most, with the exception 
of the flags which request calculation and the linkage table which 
unites the two halves of the program data base, can be changed at the 
run-set level. 

In structuring the input algorithms for CARE2, it was apparent 
that both the size of the data base and the assumption that' the program 
would be used for sensitivity analyses, indicated an "inputs only for 
changes" rule should apply throughout. Thus program defaults, where 
provided, are initiated prior to user inputs (except in unaltered por- 
tions of READIN) . The default values themselves, in the case of basic 
reliability model parameters, may optionally be input early in the 
batch run. In addition, by setting LSTCH to true (cf . Table 5-1) , the 
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inputs required for parameter variation can often be minimized. In 
this case, non-default inputs to the iteration array PAi^AM will be 
extended upward along the iteration dimension, replacing only default 
values. 

The development of CARE2 was performed using MODIFY, a system 
level program on the Bedford 6700. MODIFY facilitates the manipulation 
of source code by representing the original and subsequent versions 
as card deck images stored in disk files. Cards and decks are in- 
serted, moved or deleted, by introducing directives in groups which 
are also represented as card decks, and are saved with the source 
for future reference. 

The identifying names seen in columns 73-80 of, the CARE2 source 
code refer to the origin of the card. Thus cards bearing the subpro- 
gram name, in the case of original CARE subprograms, were in fact in the 
original program. The modification set name, where it appears, in 
general refers to the reason for the card's replacement or addition. 

For example, CAREFIX cards arose from early modifications 
which were required to enable the CARE program to operate properly 
on the Bedford computer, MAINLOG cards indicate basic, logic modifi- 
cations, and PARMOD cards refer to the restructured parameter data , 
base, 

4,2 SUBPROGRAM DESCRIPTIONS 

This section describes, in paragraph form, each of the new and 
altered subprograms, as they exist in the "complete" version of the 
CARE2 program. Each is also described in flow diagram form (cf. 

Appendix A) as well as in the source listing itself (cf. Appendix B) . 
Table 4-1 is provided in order to aid in their location. 
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TABLE 4~1 

CROSS-REFERENCE LISTING OF CARE2 SUBPROGRAMS 


Subprogram 

Name 

Source 
Listing 
S/R Order 

BISECT 

2 

*CARE2 

' 1 

COVAGE 

47 

COVGEN 

37 

CVGPI 

54 

CVGP3 

51 

CVGP4 

52 

CVGS 

50 

CVGl 

48 

CVG2 

49 

CVTJL 

53 

DCG 

45 

DCH 

43 . 

DCOMB 

46 

DCR 

39 

DCRU 

44 

DCS 

42 

DCT 

41 

DCT2 

40 

* EQUAL 

3 

FEVAL 

56 

FFAC 

4 

FINTEG 

57 

FNCK 

5 


Paragraph 

# 

Flowchart 

# 

4. 2. 1.1 

A-1 

4. 2. 4.1 

A-18 

4. 2. 2. 3 

A-8 

4. 2. 4. 8 

A-25 

4. 2. 4. 5 

A-22 

4. 2. 4. 6 

A-23 

4. 2. 4. 4 

A-21 

4. 2. 4. 2 

A-19 

4. 2. 4. 3 

A-20 

4. 2. 4. 7 

A- 24 

4. 2. 3. 7 

A-16 

4. 2. 3. 5 

A- 14 

4. 2. 3. 8 

A- 17 

4. 2. 3.1 

A-10 

4. 2. 3. 6 

A-15 

4. 2.3.4 

A-13 

4. 2. 3. 3' 

A-12 

4. 2. 3. 2 

A-11 

4.2.4.10 

A-27 

4.2.4.11 

A- 28 

_ 



Comments 

Original 

Modified CARE main program 

Added 

Added 

Added 

Added 

Added 

Added 

Added 

Added 

Added 

Added 

Added 

Added 

Added 

Added 

Added 

Added 

Added 

Original 

Added 

Original 

Added 

Original 


^Affected by array dimension modifications, as required for 
optional reduced field length execution. 
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Subprogram 

Name 

FNl 
FNII 
FN2 
FN2I 
FN3 
FN3I 
FN4 
FN4I 
FN5 
FN5I 
FN6 
FN6I 
IGET 
INTEGR 
■ I PUT 
ISHIFT 
Move 

NEQIA 

NEQIB 

NEQ2A 

NEQ2B 

NEQ3 

NEQ4A 

NEQ4B 

NEQ5 

NEQ6 
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TABLE 4-1 (cont.) 


Source 

Listing 

Paragraph 

Flowchart 

■ 

S/R Order# 

# 

# 

Comments 

61 

4.2.4.15 

A-32 

Added 

62 

4.2.4.16 

A-33 

Added 

63 

4.2.4.17 ’ 

A-34 

Added 

64 

4.2.4.18 

A-35 

Added 

65 

4.2.4.19 

A-36 

Added 

66 

4.2.4.20 

A-37 

Added 

67 

4.2.4.21 

- 

Added (dummy routine) 

68 

4.2.4.22 

- 

Added (dummy routine) 

69 

4.2.4.23 

- 

Added (dummy routine) 

70 

4.2.4.24 

- 

Added (dummy routine) 

71 

4.2.4.25 

- 

Added (dummy routine) 

• 72 

4.2.4.26 

- 

Added (dummy routine) 

58 

4.2.4.12 

A-29 

Added 

6 

- 

- 

Original 

59 

4.2.4.13 

A-30 

Added 

60 

4.2.4.14 

A-31 

Added 

26 

4. 2. 1.4 

A-4 

Modified to use ISHIFT 

9 

4. 2.1. 2 

A-2 

Corrected logic error 

10 

- 

- 

Original 

7 

- 

- 

Original 

8 

- 

- 

Original 

11 

- 

- 

Original 

12 

- 

- 

Original 

13 

- 


Original 

14 

- 

- 

Original 

15 

- 

- 

Original 
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TABLE 4-1 (cont -2) 


Source 


Subprogram 

Name 

Listing 
S/R Order# 

Paragraph 

.# 

Flowchart 

# 

Comments 

*NEQ7 

' 16 

4. 2,1.3 

A- 3 

Replaced dummy 

PARARl 

17 


- 

Original 

PGET 

38 

4.2. 2.4 

A- 9 

Added 

PLOTN 

24 

- 

- 

Original 

*PLOTR 

22 

- 

- 

Original 

*PLOTRV 

18 

- 

- 

Original 

*PLOTT 

23 

- 

- 

Original 

PROD 

20 

- 

- 

Original 

PRODl 

21 

- 

- 

Original 

RCOMB 

19 

- 

- 

Original 

*READIN 

34 

4.2. 1.5 

A- 5 

Modified 

READIN2 

35 

4. 2. 2.1 

A-6 

Added 

REDUC 

27 

- 

- 

Original 

RELATE 

28.5 

- 


Deleted 

RELEQS 

30 

- 

- 

Original 

*RIFDIF 

29 

- ' 

- 

Original 

RITE 

30.5 

- 

- 

Deleted 

ROMBD 

28 

- 

- 

Original 

ROWPLT 

31 

- 

- 

Original 

SCAN 

33.5 

- 

- 

Deleted 

SEARCH 

' 31.5 

- 

- 

Deleted 

SIMPLE 

33 

- 

- 

Original 

SIMPRl 

32 

- 

- 

Original 

SPECIT 

55 

4. 2.4. 9 

A-26 

Added 

TRANS FR 

36 

4. 2. 2. 2 

A-7 

Added 

WRNR 

25 

- 

- 

Original 


*Affected by array dimension modifications, as required for 
optional reduced field length execution. 
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4.2.1 Altered Subprograms 

4. 2. 1.1 CARE2 (Main subprogram) 

The basic order of processing is the same as that of CARE. 

All output options which were allowed in CARE exist unchanged in CARE2, 
with the exception of simultaneously varying more than one parameter 
type and automatically obtaining all possible combinations. The cur- 
rent ability to repetitively input changes to both parameter and cov- 
erage data provides an equivalent capability and, in addition, allows 
for direct user selection of the combinations to be evaluated. The 
restrictions on plotting options which applied to evaluations of a 
product of reliabilities now apply, analogously, to evaluation of a 
dual mode system (Eq. 7) with more than one stage. 

A dual mode system of up to 8 stages can be modelled in 
series with other computer models, by declaring array PROD as a like 
quantity of 7's, followed by the numbers of the other desired equations. 

As in CARE, the invariant data for a batch run is specified 
via READIN. The additional data now contained in this category, as 
included in Figure 5-1 and Table 3-1, consists of; 

• flags which enable the calculation of coverage and the 
display of intermediate coverage results 

• a flag enabling the display of mode 1 reliability of the 
dual mode system model and its individual stages 

• a flag enabling display of the reliability model parametric 
inputs,' iteration array and dual mode system special 

data (defaulted to .TRUE.). 
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• linkage data. to allow calculated coverage values to 
be applied to reliability models 

• a flag to specify the method used to fill the iteration 
array 

• a debug flag (see below). 

Following READIN, the variant data is input via READIN2 . 

(The data following READIN, is termed variant because the call to this 
subprogram is the beginning of a main subprogram loop which may be 
continued indefinitely.) All inputs to READIN2 are made via control 
cards, which determine the data to be input as well as the format to 
be used. This data includes both parameters for the reliability 
models and inputs to the coverage model. 

After all data has been processed, the coverage driver, 
COVGEN, is called to effect all requested coverage calculations and 
transfer their result to the run vectors, using the linkage algorithm. 
The iteration array, PARAM, is then adjusted as required, and the pre- 
pared inputs to the reliability models are (optionally) displayed. 

If the DEBUG flag has been set, control passes directly back 
to READIN2, and no reliability model evaluation occurs. This is use- 
ful 1) for checking data cards for validity, and 2) when coverage 
analyses only are desired. 

If DEBUG is not set, reliability equations are evaluated 
sequentially, and reliability versus the independent variable, fol- 
lowed by the selected computational and plot options, are printed. 

In the case of dual mode system, the independent variable must be 
time and must begin at zero, due to the use of structured numerical 
integration . 

An additional, more specific, overview of CARE2 is provided 
in Appendix A (cf . flow diagram A-1) . 
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4. 2. 1.2 NEQIA 

A simple correction was made to this routine during the 

i 

initial investigation of CARE. Specifically, a card sequence 
error was found in the algorithms' source listing which in turn 
caused the summation on L (or LX) to be set equal to its last 
term. (cf . flow diagram A-2) 

4 . 2 . 1 . 3 NEQ7 

As mentioned in Section 4.1, the data available to this 
sub-routine via its argument list is inadequate for evaluation of 
the dual mode model. The model is thus referenced separately by 
the main program, with parameters held in common blocks. 

NEQ7 is capable of interpolating the results stored in 
array R(12l,25) , when referenced via RELEQS by subprograms which 
perform secondary calculations (e.g., MTF) . It uses a 2nd order 
(parabolic) integration technique. (cf. flow diagram A-3) 

4. 2. 1.4 Move 

This routine, which transfers six-bit characters between 
words of memory, originally required a Langley installation routine 
called SETBIT. It now uses ISHIFT, which is written in COMPASS 
and included in the source code. (cf. flow diagram A-4) 

4. 2. 1.5 READIN 

The basic input package provided by READIN remains un- 
altered with the exception that the reliability model parameters 
(Q, N, LAM, etc.) have been removed from namelist $VAR. As in 
CARE, READIN is referenced once by CARE2, and thus data processed 
by it cannot be changed during later execution. 
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since CARE inputs were originally based on a time-sharing 
question and answer format, they are rather awkward for use in 
batch mode, and for clarity require the use of either a standardized 
procedure or analysis of subprogram READIN. The substantial number 
of new inputs now required to accommodate the dual mode and coverage 
models complicate this matter further. In an attempt to clarify the 
issue somewhat, the current input algorithm is shown in Figure 5-1, 
in flowchart form. 

The essential differences between CARE and CARE2 in this 
context are: 

• A number of logical variables have been added to Name- 
list $OPTSON, including DEFCHNG and COVPRC. 

• Two additional Namelists, $DEFAULT and $COVCAL, are 
read following $VAR, conditional on their respective 
flags, DEFCHNG and COVPRC, being set .TRUE. 

• As in CARE, the inputs to ElEADIN are invariant for the 
entire batch run. The addition of subprogram READIN2 
allows most parametric data to be altered between 
run-sets 

• The basic reliability model parameters (Q, S, LAM, etc.) 
have been moved from Namelist $VAR in READIN to Namelist 
$PARVEC in READIN2. 

More detailed information on the current complement of var- 
iables and flags, as well as an overview of the revised READIN 
flow and the context of how it is used in CARE2, is provided in 
Sections 2.3, 4.2.1. 1 and 5.1.2 and flow diagram A-5. 
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4.2,2 New Utility Subprograms 

4. 2. 2.1 READIN2 (NRS , DMFLG, IRES ,LVARY,DVAL , DEBUG) 

This subroutine is called by CARE2 directly after, READIN, 
and again after completing evaluation of the reliability models 
and secondary calculations. It is thus possible to continue 
altering the parametric and/or coverage data base, and in turn 
re-evaluating the models, indefinitely. 

Due to the "inputs only for changes" rule, extensive use 
is made of the Namelist feature of Fortran. Also, for this reason, 
all input processing is carried out in direct response to control 
cards. The routine continuously reads these cards, using a 
standard format, and processes them along with other indicated data, 
such as namelist inputs, until such time as a return to model 
evaluation is specifically requested. 

The control cards contain an identifying code in columns 
1-3 which indicates one or more of the • following : 

• a reference to a particular Namelist follows in the 
input stream 

• the control card contains specific information in 
a fixed format 

• the subroutine is to perform a specific processing 
task. 
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A code which begins with "I" requests the reading of a 
Namelist, and those beginning with "P" cause the immediate print- 
ing of a portion of the data base. A general description of each 
of the 12 control codes is contained in table 5-2. (cf. flow 
diagram A-6) 

4. 2. 2. 2 TRANSFR(X,I) 

This subroutine conditionally prints a line of parametric 
input data starting at the address of X. I is the element of 
array PCW(22), which contains applicability information concerning 
the parameter vector to be printed. Only those values pertaining 
to the applicable reliability model are printed. (cf . flow 
diagram A-7) 

4 . 2 . 2 . 3 COVGEN 

If one or more of the coverage factors are to be calcul- 
ated, this routine is called by CARE2 directly after return from 
READIN2. COVGEN responds to requests for coverage calculation 
from both the IGENC(IO) and IGENP(IO) flags (which correspond to 
stages of the reliability models) and to individual requests 
initiated by setting coverage parameters to negative values . For 
each request, the routine determines applicability by testing the 
proper element of PCW(22). 

Since delta coverage is computed using corresponding 
basic coverage, the order of calculation is by parameter type, 
with subclass linkage tests secondary. A fault subclass in 
linked to a stage if the corresponding element in IFSC (8) is the 
stage number. The corresponding fractional fault rate in FRAC(8) 
is then applied. (cf. flow diagram A-8) 
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4. 2. 2. 4 PGET(IR,I,NPROD) 

PGET refreshes the run vector of the parameter to be 
varied with NPROD values from row I of the iteration array PARAM 
(16,10). IR indicates the parameter of interest. (cf. flow 
diagram A-9) 

4.2.3 Dual Mode Model Implementation Subprograms 

4. 2. 3.1 DCR(J,UNITR,RELM1,RSYS) 

This subroutine, driven by the time step counter J, 
computes the dual mode system (eq. 7) reliability RSYS at the 
point in time, t = STEP*(J-1) . It also evaluates the system and 
individual stage reliabilities as they apply to mode 1 operation 
alone (RELM1,UNITR(8) ) . (cf. section 3.1.7 and flow diagram 
A-10) 

4. 2. 3. 2 DCT2(J) 

This function computes the probability that the system 
will have survived to time t=STEP*(J-l) , having degraded due to 
a category 2 switch failure. (cf. section 3.1.6 and flow dia- 
gram A-11) 

4. 2. 3. 3 DCT(IUN,J) 

This function computes the probability that the system 
will have survived to time t=STEP*(J-l) , having degraded due to 
a failure in stage lUN. (cf. section 3.1.5 and flow diagram 
A-12) 

4. 2. 3. 4 DCS (IUN,T,TAU) 

This function computes the conditional probability 
that stage lUN can survive in mode 2 from time TAU to time T, 
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given that a degenerative failure occurred in the stage at time 
TAU. (cf. section 3.1.4 and flow diagram A-13) 

4.2.3. 5 DCH(IUN,TAU) 

This function computes the probability density of a 
degenerative failure in stage lUN of the system at time TAU. 

(cf. section 3.1.3 and flow diagram A-14) 

4. 2. 3. 6 DCRU(IUN,MD,L,TAU) 

This function computes the probability of using at most 
L spare units in stage lUN by time TAU, given the system is 
operating in mode MD. (cf. section 3.1.2 and flow diagram A-15) 

4. 2. 3. 7 DCG(IUN,MD,I,T) 

.This function computes the probability of using exactly 
I spare units in stage lUN by time T, given the system is 
operating in mode MD. (cf. section 3.1.1 and flow diagram A-16) 

4. 2. 3. 8 DCOMB(TOP,K) 

This function computes the binomial coefficient of the 
expression: 

( T0P\ 

/ 

It is equivalent to the RCOMB function except for the case when 
K=0. (cf. flow diagram A-17) 

4,2.4 Coverage Model Implementation Subprograms 

4. 2. 4.1 COVAGE (ISU, MD, lET, JS) 

This function returns a single value, either a coverage 
factor or transient recovery probability, for each reference. 
(Since little sharing of intermediate variables is possible in 
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coverage calculations, efficiency is not lost by the single 
value format.) The routine is driven by COVGEN, which provides 
the proper arguments for the type of coverage value required for 
the reliability models. 


where 


The Fortran statement for referencing COVAGE is: 

CVAL = COVAGE (ISU,MD,IET,JS) 

• ISU is the fault class or subclass 

• MD is the computer system operational mode 

• lET is the major fault type, either a permanent 

or transient^ failure 

• JS is the number of spare units which must be 

checked out during the recovery process. 


With reference to the definition of coverage (cf. 
section 2.3) as a conditional probability, the arguments MD and 
lET can be taken as the "given" conditions, i.e., coverage input 
data may be entirely different for differing values of these 
variables. In addition, a coverage factor is computed independ- 
ently for each class or subclass of faults, where a class is 
defined as those faults occurring in a specific stage of the 
computer model. Within the fault class, faults may arbitrarily 
be divided into subclasses, provided that a decimal fraction, 
specifying the relative rate of fault occurrence, is assigned 
to each subclass. A subclass may be those faults occurring in a 
specific part, or subunit of an LRU, or those having a certain 
characteristic, such as "easy to find", etc.. The coverage 
factor applied to a computer stage is the average of the factors 
calculated for the applicable subclasses, weighted by the rela- 
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tive rates of occurrence assigned by the user. 

Each call to the COVAGE sub-program results in a calculation 
of the systems' conditional ability to detect, isolate (to a specific 
LRU) and recover from the specified subclass of faults (ISU) , under 
given conditions of mode, fault type and quantity of spares checked 
(as defined by MD, lET and JS respectively). To accomplish this, 
it is necessary that the components of the recovery system (i.e., 
the ftp's) be described, as input to the model, in terms of their 
stand-alone capacity to contribute. The model evaluation then 
combines the effects of these by statistically accounting for the 
competitive nature of the detectors, and the conditional isolation 
and recovery success probability associated with each. 

The quantity of the data required by the model, due to its 
flexibility, is necessarily - large. The user can, however, do much 
to simplify the corresponding data specification task by approaching 
it systematically. A convenient (although not necessary) starting 
point for this is to first list the names of all hardware and 
software fault detection devices available in the computer system. 

If there are 20 or less, a unique number is then assigned to each 
name which, in turn, represents the D/i/R mechanism 

corresponding to that detector. (If there are more than 20, sharing 
of numbers is required.) 

Figure 4-1, by way of example, represents the processes 
(for all valid combinations of mode and fault type) which deal 
with faults of subclass 3. Each row therein represents the portion 
of a D/I/R mechanism corresponding to a single detector and each 
column, the set of coexistent processes which compete for detection 
of (and subsequent recovery from) faults of a particular type and 
occurring during a particular mode of operation. The four numbers 
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contained in each intersect, designated function numbers, specify 
the detection, isolation, and 2 recovery characteristic functions 
selected to represent the components of the corresponding mechanism. 
Each such number designates a fully specified time function whose 
parameters are contained in one of four specification arrays, 
corresponding respectively to detection (D) , isolation (I) , error- 
propagation-recovery (E) and time-delay-recovery (T) functions (cf. 
Table 5-4) . Each array has sufficient room for a number of 
specification lists which, in turn, are common to the entire model. 

Looking again at figure 4-1, it should be noted that the 
only restrictions placed on the selection of function numbers is 
that a non-zero (i.e., enabled) detector be accompanied by non-zero 
function numbers for isolation and recovery, and that all 4 functions 
be properly defined. 

In order to generate a characteristic curve versus time 
(or its integral) , the coverage model uses an evaluation sub- 
program which applies the specifications of the selected function 
number to one of several common function models. Each model, in 
turn, is a Fortran sub-program, in the form of a generalized 
equation with one independent variable and up to 3 fixed para- 
meters, as for example: 

f (t) = a(e~^^+c) . 

The user generates specifications at run time, in part, 
by linking each function number to a particular function model, and 
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in part, by selecting appropriate values for a, b and c. He may 
also alter, or add to, the set of function models, recompiling 
only the corresponding model sub-programs. Once specification 
lists have been input, the D/l/R sequences (i.e., mechanism 
portions corresponding to specific conditions) can be defined, 
one at a time or in groups, by selecting 4 function numbers for 
each. 


It should be noted that the functions selected are 
characteristic curves which describe detection, isolation and 
recovery capabilities under a given set of conditions. A soft- 
ware detection device which is good at finding permanent failures 
in a CPU register may perform poorly when the register simply 
"drops a bit." The characteristic curves, and thus the functions 

(numbers) selected, will be different even though the same 
detector is used. 

The 4 processes comprising a D/I/R sequence must be 
clearly understood, from a statistical point of view, before a 
realistic recovery system can be modelled. Up to this point, 
these functions have not been given a physical significance in 
order to avoid confusion. Although the 4 functions are imple- 
mented in the same manner, the detection and isolation functions 
have units of probability density, whereas the 2 recovery 
functions have units of probability. More specifically, the 
detection and isolation functions are rate functions, which for 
the present purposes means that the shape and coefficient of 
each detection and isolation function is provided and utilized 
separately. These rate functions (exclusive of the coefficient) 
are required by the model to have an integral of 1.0, with the co- 
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efficient then lying between 0.0 and 1.0. A normalization 
routine is provided in READIN2 to assure that the rate func- 
tions are correct. 

The error -propagation-recovery and time-delay-recovery 
functions give the probability of successful recovery versus 
the time 1) that errors are allowed to propagate through the 
system and 2) that is required for detection plus isolation of 
a fault. The model treats these probabilities as conditional 
upon each other, such .that the success probability for a given de- 
tection time and a given isolation time is the product of the 
two recovery functions. 

Other data items which affect coverage are the proba- 
bility of recognizing a failure in a spare LRU, as selected to 
replace a failed active unit, and the time required to test such 
a spare, whether or not it has failed. (cf. section 3 and flow 
diagram A-18) 

4. 2. 4. 2 CVG1(I,T) 

This function computes the coefficient of the func- 
tion for unscheduled impulse detector I, at time T units after 
the occurrence of a fault. The g function is an impulse in 
this case, and its coefficient indicates the competitive effec- 
tiveness of the detector. (cf. section 3.2.4 and flow diagram 
A-19) 


4-20 




RAYTHEON COMPANY 


RAYTHEON 


EQUI P MENT 


DIVISION 


4. 2. 4. 3 CVG2(I,T) 

This function computes the value of the g function for un- 
scheduled finite detector I, at time T units after the occurrence of a 

fault. The g function is a measure of the competitive effective- 
ness of the detector. (cf. section 3.2.4 and flow diagram A-20) 

4. 2. 4. 4 CVGS(I,T) 

This function computes the value of the g function for 

scheduled detector I, at time T units into the period of the test 
(detector) . The detector may be either an impulse or finite function, 

and the calculation is based in part on the sample g' values saved in 
array GPAR(20, 101) . The g function is a measure of the competi- 
tive effectiveness of the detector. (cf. section 3.2.4 and flow 
diagram A-21) 

4. 2. 4. 5 CVGP3(I) 

This subroutine computes and stores N+1 samples of the 
g* function for scheduled impulse detector I, over its period 
(i.e., the time between successive runnings of the test). The 
assumption is made that no detection can be made after one such 
period, measured from the occurrence of a fault. The g' function 
is a measure of the effectiveness of the test, in competition 
with other scheduled detectors. (cf. section 3.2.5 ,and flow 
diagram A-22) 

4. 2. 4. 6 CVGP4(I) ' ^ 

This subroutine computes and stores N+1 samples of the 
g* function for scheduled finite detector I, over its period 
(i.e., the time between successive runnings of the test). The 
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assumption is made that no detection can be made after one 
such period, measured from the occurrence of a fault. The g' 
function is a measure of the effectiveness of the test, in 
competition with other scheduled detectors. (cf. section 3.2.5 
and flow diagram A-23) 

4. 2. 4. 7 CVTJL(J,L,I) 

0 

This function returns the time difference value tr(i) 
for use by CVGP3 and CVGP4 in evaluating the competition between 
scheduled detectors, i.e., the effect of detector J on detector 
of interest I. (cf. flow diagram A-24) 

4. 2. 4. 8 CVGPI(I,T) 

This function interpolates the sample g' values (for 
detector I) which have been saved in array GPAR(20,101) by 
CVGP3 and CVGP4. The result is used by CVGS in determining the 
effectiveness of scheduled detectors in competition. (cf. flow 
diagram A-25) 

4 . 2 . 4 ; 9 SPECIT (FLI ST , NUM, IGFT , I SCH , IREP , INTF , COEF , TDEL , PI , P2 , 

P3,TDUR) 

This subroutine sets up a specification list (in one 
of four arrays) which defines a characteristic curve (function) 
versus time for either a detection, isolation, err or -propagation- 
recovery or time-lost-recovery process. The respective specifi- 
cation arrays which receive this data are FDET(7,200), FISO(7,50), 
FEPR(7,25) , or FTLR(7,25). FLIST is the address of one of these 
arrays, and NUM the column position within it where the data will 
be stored. The remaining arguments are the function specifica- 
tions ; 
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IGFT The identification number of the function model 

to be employed in evaluating the curve 
ISCH The scheduling indicator. Equal to 0 
for unscheduled, or 1 for scheduled 
detectors. Does not apply to other pro- 
cesses 

IREP The repetition factor of scheduled detec- 
tors only, i.e., the number of minor cycles 
in the detector period 

INTF The integral-defined indicator. Equal to 
0 if the integral model corresponding to 
IGFT above does not exist, and 1 if it does 
COEF The explicit coefficient of. the function. 

For detection and isolation, it is also 
the infinite-time success probability 
TDEL The delay time associated with a detector 

or isolator. For detectors, it is measured 
either from the occurrence of a fault 
(unscheduled) , or from the beginning of the 
test period (scheduled) 

P1,P2,P3 Arbitrary parameters which are passed 
to function model number IGFT via COMMON 
block CVB4, and used for the evaluation of 
the process characteristic. If normalization 
is to be performed using the routine in 
READIN2, PI must be an internal coefficient 
in the function model selected to represent 
either a detection or isolation rate function 
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• TDUR The duration of any finite process, includ- 
ing recovery. This is used to determine 
upper limits for numerical integrations, 
and for reasonableness sampling when re- 
quested. 

(cf. flow diagram A-26) 

4.2.4.10 FEVAL(FLIST,NUM,T) 

This function evaluates the characteristic function 
representing a detection, isolation,- error-propagation-recovery, 
or time-lost-recovery process. The function specifications are 
retrieved from the array position indicated by FLIST (which may 
stand for FDET, FISO, FEPR, or FTLR) and NUM (the column position 
within the array) . The independent variable T is passed to the 
function model. (cf . flow diagram A-27) 

4.2.4.11 FINTEG (FLIST, NUM,T) 

This function evaluates the time integral of the 
characteristic function representing detection (although it 
could be used for another process) . The function specifications 
are retrieved as in FEVAL, and the integral-defined indicator is 
tested. If it is greater than 0, the independent variable is 
passed to the proper integral model. Otherwise, a numerical in- 
tegration (Simpson 3 point) is performed using N+1 samples of 
the function, which are returned by FEVAL. N is a local variable 
which may be altered by the user, via a recompilation of FINTEG 
alone. (cf. flow diagram A-28) 
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4.2.4.12 IGET(IWORD, INDEX) 

This function returns the INDEX 'ed 12 bit field of 
IWORD (one of 5 in the 60 bit word) , as a right justified 
integer. (cf . flow diagram A-29) 

4.2.4.13 INPUT (IWORD, INDEX, ICODE) 

This subroutine packs the rightmost 12 bits of ICODE 
into the INDEX 'ed 12 bit field of IWORD, after first clearing 
the field. (cf^. flow diagram A-30) 

4.2.4.14 ISHIFT( IWORD, NBITS) 

This COMPASS function returns, as an integer, the 
result of performing a left circular shift of NBITS on IWORD. 
(cf. flow diagram A-31) 

4.2.4.15-16 FNl(T) and FNII (T) 

This function model and corresponding integral model 
represent the characteristics of a "single pulse" or constant 
amplitude function. General parameter PI is the amplitude, and 
the only parameter used in this case. It is important to note 
that the model does not incorporate a cutoff of its own, but is 
defined simply as FN1=P1. The coverage model applies the TDEL 
and TDUR values in its evaluation, in such a way as to properly 
service the numerical integral routines. (cf. flow diagrams 
A-32 and A-33) 

4.2.4.17-18 FN2(T) and FN2I (T) 

This function model and corresponding integral model 
represent the characteristics of a "pulse train" function, with 
amplitude PI, pulse width P2, and pulse repetition period P3. 
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Due to the discontinuities inherent in this function, it is not 
recommended for general use, except on an experimental basis. 

(cf. flow diagrams A-34 and A-35) 

4.2.4.19-20 FN3(T) and FN3I (T) 

This function model and corresponding integral model 
represent the characteristics of an exponential function. The 
equation is ; 

. — P2*T 

FN3(T)=Pl(e +P3) 

(cf . flow diagrams A-36 and A-37) ' 

'4.2.4.21-26 FN4(T), FN4I(T), FN5 (T) , FN5I (T) , FN6(T) and FN6I (T) 

These dummy functions are provided to allow the user to 
expand the inventory of models, as more complex systems for cover- 
age modelling are evolved. 

4.3 MAJOR ARRAYS AND COVERAGE DATA BASE 

The following sections describe both the new format of 
the reliability model parameter arrays, and the changes in dimen- 
sion needed to run the reduced field length version of CARE2. 

4.3.1 Reliability Parameters 

As mentioned previously, the original 16X10 parameter 
arrays proved too cumbersome for use with the dual mode model, 
due to the complex method employed for extracting individual 
elements. The new version of this data base consists of a set 
of 19 independent and 3 dependent parameter vectors, of dimen- 
sion 10. Each vector holds the inputs; for up to 10 computer 
stages, for one parameter (e.g., Q» LAM etc.). 
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A parameter control array, PCW(22), contains applic- 
ability and other information, in packed format, with one ele- 
ment of the array corresponding to each parameter. The first 
part of table 4-2 shows the fields within a word, and the sec- 
ond part of the table gives the parameter order and association 
of parameters with default variables. 

4.3.2 Reduced Field Length CARE 2 

In the course, of debugging the program with the new 
models included, the dimensions of certain of the plotting 
arrays not required for reliability computation/ were temporarily 
reduced in size. The result was a 30K (octal) savings in required 
field length for execution. The user who wishes to obtain only 
tables of reliability can convert the complete program version 
by making the changes shown in table 4-3. 
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TABLE 4-2 
PCW AND DEFAULTS 


-ARRAY PCW (22)- ' 

PCW contains packed information pertaining to the type and 
usage of each parameter in the reliability model. The fields 
within each word (one per parameter) are as follows: 


A B. 


□ 

1 

1 

1 _ 

TT 

1 1 
A 1 


1 

1 

1 

TT-4!— 

T~ 

□ 

59 

58 

52 


39 

30 

4 

0 


Field 


Meaning if bit position 

is equal to one 



B. 


Parameter is represented in real format (as is 

default ) , otherwise it is in integer format 

th 

Parameter is used in the i reliability equation, 
and is located at (59-i) 

Parameter is used in the reliability equation 

'th 

associated with the i stage, and C^ is located 
at (40-i) . (These bits are set by READIN for use 
by COVGEN & RITE 2) 

5 bit right justified integer giving the subscript 
of the default value in either INTDFS(4) or RLDFS(IO) , 
depending on field A (bit 59) . 
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TABLE 4-2 (Cont.) 

-Parameter Names, Defaults and Usage- 


PCW 

SUBSCRIPT 

NAME 

DEFAULT 

ARRAY 

POSITION 

DEFAULT 

VALUE 

1 

2 

EQUATION 
3 4 5 

6. . 

7 

1 

Q=Q1 

QlDEFr 

RLDFS (1) 

2.0 


X 

X 




X 

2 

Q2 

Q2DEFr 

RLDFS (2) 

1.0 







X 

3 

N 

NDEFi 

INTDFS (1) 

3 

X 







4 

S 

SDEFi 

INTDFS (2) 

0 

X 

X 

X 

X 



X 

5 

W 

WDEFi 

INTDFS (3) 

1 

X 

X 

X 

X 

X 

X 


6 

Z 

ZDEFi 

INTDFS (4) 

1 

X 

X 

X 

X 

X 

X 





• 

-6 








7 

LAM 

LAMDEFr 

RLDFS (3) 

10 

X 

X 

X 

X 

X 

X 

X 





— 6 








8 

MU 

MUDEFr 

RLDFS (4) 

10 

X 

X 

X 

X 



X 





—6 








9 

GMP 

GMPDEFr 

RLDFS (5) 

10 







X 

10 

H 

o 

11 

u 

CDEFr 

FLDFS (6) 

1.0 


X 

X 




X 

11 

CDl 

CDEFr 

RLDFS (6) 

1.0 







X 

12 

C2 

II 

II 

II 







X 

13 

CD2 

It 

II 

II 







X 

14 

CTR 

CTRDEFr 

RLDFS (7) 

II 







X 

15 

CDTR 

II 

II 

II 







X 

16 

PRCl 

PRCDEFr 

RLDFS (8) 

II 







X 

17 

PRC 2 

II 

II 

II 







X 

18 

RV 

RVDEFr 

RLDFS (9) 

II 

X 

X 

X 

X 

X 



19 

P 

PDEFr 

RLDFS (10) 

0.5 





X 



20 

K 


11 

N/A 

X 

X 

X 

X 



X 

21 

GMl 


II 

N/A 







X 

22 

GM2 


II 

N/A 







X 


Note; 

- other parameters (scalars) are printed when DCFLG = .TRUE. 

- internal representation is denoted by subscript r for 
real and i for integer. 
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TABLE 4-3 


DIMENSION CHANGES FOR REDUCED FIELD LENGTH CARE 2 


Subroutines affected: 


* 

CARE2 (main subprogram) 

* 

EQUAL 

* 

NEQ7 

* 

PLOTR 

* 

PLOTRV 

* 

PLOTT 

* 

RE ADI N 


RIFDIF 

Arrays 

affected: 


Complete version 

Reduced version 

R(121,25) 

R(121,10) 

DIFF(121,15) 

DIFF(1,1) 

RIF(121,15) 

RIF(1,1) 

GAIN(121, 15) 

GAIN (1,1) 

ABSC (121,3) 

ABSC (1,1) 

G (121,16) 

G(l,l) 

XY(121,19) 

XY(1,4) 

RLRV(210, 17 

RLRV(1,2) 

FDUM(3233) 

FDUM(l) 
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SECTION 5 
PROGRAM OPERATION 

CARE2 is designed to run on a 60 bit CDC 6000 Series 
computer system, using the RUN Fortran compiler in batch mode, 
under either the KRONOS 2.1 or SCOPE 3.0 operating system. 

Source and data input are nominally provided on punched cards, 
and output is nominally produced on ai line printer. However, 
both operating systems allow reassignment of input/output 
devices with little effort. 

In modifying and adding to the Fortran source code, 
care was taken to avoid using features of extended Fortran 
versions which were not used in the original CARE program. 

Reliance on library routines -was also held to a minimum. The 
SETBIT routine is no longer required, and the ISHIFT function 
is included in the source as a Fortran callable COMPASS program. 
Other library references are to long-time standards such as 
AMINl, OR, etc.. 

In comparison with CARE, which requires about IlOK 
octal field length to load and execute, CARE2 runs in slightly 
under 130K -(the precise figures depend somewhat on the com- 
piler and operating system used) . The reduced version of CARE2 
runs in about lOOK. 

5.1 USER'S GUIDE 

This section, although intended for those familiar 
with the operation of CARE, provides the information required 
for compiling all the input data to model computer systems and 
coverage systems with CARE2. An input algorithm, in flowchart 
form, is included to simplify usage. The use of the Fortran Name- 
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list feature and control cards on input facilitates sensitivity 
analyses, which is an expected mode of operation. 

5.1.1 Processing Order 

The general processing order of the program as a 
whole is as follows : 

t 

1) Input the computer configuration to be modelled, 
using one or more equation numbers to represent 
stages in series or, in the case of the dual mode 
model, a number of 7's to represent the stages 
within that model. 

2) Input the upper limit of the independent variable 
and the step size to be used to generate reli- 
ability tables. 

3) Input data to specify the desired computational 
and plotting options, changes to defaults 
(optional) , and linkage data for the subsequent 
transfer of calculated coverage values to the 
reliability model date base. 

4) Input non-default values for parameter base run 
vectors, coverage function selection and speci- 
fication data, and special dual mode and coverage 
model variable data, as required. 

5) Input iteration data, if any. (Enables the re- 
running of computer model evaluations with varia- 
tions in a single parameter) . 

6) Print out selected parts of the coverage data base, 
if requested. 
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7) Compute any requested (and applicable) coverage 
factors, optionally displaying intermediate re- 
sults (conditional D/l/R mechanism contributions) . 

8) Optionally print out parametric data to be used 

in reliability model evaluations, including calcu- 
lated coverage and special variables. 

9) Compute and print tables of reliability versus the 
selected independent variable, sequentially for 
each equation in the configuration list. If the 
dual mode model (equation 7) is evaluated, optionally 
print the mode 1 system and stage reliabilities 
versus time (ahead of the standard reliability 
table) . 

10) Compute product, MTF and other selected options, 
including plots. 

11) Repeats steps 4 through 10, changing only the 
desired variant data. (Note that the iteration 
parameter and data will remain the same if not 
actively altered or defeated.) 

5.1.2 Use of the Input Algorithm 

Figure 5-1 is a working guide for the creation of an 
input deck. The requirements are unusually complex, partly 
due to the volume and variety of data, and partly since the input 
routine was originally written in a question-and-answer format 
for use in a time-sharing environment. 

Each input record must of course contain the proper 
information in the proper format. Thus the algorithm, in flow 
chart form, covers all possible inputs in both READIN (invariant 
data) and READIN2 (variant data) . It is not intended as a complete 
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flow diagram for either of these subroutines. 

In following the diagram, note that a decision block 
contains either the name of a logical variable or shows some 
test of a variable. Input blocks which refer to Namelist names 
(designated by $) have comments showing the names and dimensions 
of its contained variables. 

Table 5-1 describes many of these variables in operational 
terms. The 12 possible control cards which initiate processing 
tasks in READIN2 are described in Table 5-2, Input blocks which 
show fixed format cards refer to format entries in Table 5-3. 
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FIGURE 5-1 


INPUT ALGORITHM 
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FIGURE 5-1 (Cont.) 
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FIGURE S'-l (Cont.-3) 
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FIGURE 5-1 (Cont.-5) 
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TABLE 5-1 

DEFINITIONS OF LOGICAL AND CONTROL VARIABLES 

Logical: Specifies whether a list of equation numbers 

is desired rather than a single equation. 

Control: Specifies the single equation to be evaluated. 

Logical: Specifies (if TRUE) that input format errors 

detected in READIN2 will not cause the job to abort, 
and that the reliability models will not be exercised. 
Logical: Specifies (if TRUE) that a table of parameter 

values to be used for the subsequent run-set will be 
printed (regardless of the setting of DEBUG) 

Logical: Specifies (if TRUE) that mode 1 reliability 

results for the dual mode model are to be printed. 
Logical: Specifies (if TRUE) that individual 

mechanism contributions are to be printed during the 
calculation of coverage. 

Logical: Determines the convention used in the pre- 

processing of parameter iteration data in array PARAM. 

If TRUE, non-default elements replace default elements 
along the iteration dimension (those of increasing 
row subscript are replaced) . For example, if the 
parameter default is 0, and a 1 is placed in PARAM 
(1,3), O's in positions (2,3), (3,3), etc. will be 

replaced by I's, up to the next non-default. 

Logical: Specifies (if TRUE) that some coverage factors 

will be calculated (under the condition that subsequent 
requirements are met) . 

Logical: Specifies (if TRUE) that some default values 

for parameter vector initialization are to be changed. 
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TABLE 5-1 ,(cont.) 

Logical; Specifies (if TRUE) that the reliability 
product table is to be plotted on the line printer. 
Control; The list of equation numbers for a product 
of reliabilities, to specify the number of stages in the 
dual mode model, or both. 

Control; The upper limit for reliability tables when 
the independent variable is time. Time must be used 
when the dual mode model is evaluated. 

Control; The upper limit for reliability tables when 
the independent variable is failure rate * time. 

Control; The upper limit for reliability tables when 
the independent variable is the exponential of (failure 
rate * time) . 

Control; The lower limit of the independent variable. 
This must be 0.0 when the dual mode model is evaluated. 
Control; The increment of the independent variable, . 
Control; The 1 or 2 letter code beginning in column 
1 of a READIN2 control card, which identifies an input 
or processing request. 

OPTR2 See original CARE documentation. 
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Identifie r 

IB 
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TABLE 5-2 
CONTROL CARDS 

Control Function 

Input the base-run parameter vectors (for reliability 
models) via namelist PARVEC 

Input system data via namelist DATA, specifically; 
SLH2, SLH3, CCSF, RSGN, PFDS, TFDS, TMINOR, and NINT. 
Input the variation parameter type code, followed by 
namelist VARY with changed values for array PARAM, 
Read a subset of the D/I/R mechanism data base 
(function number arrays) . One function number is 
specified for one process (detection, isolation, 
recovery (1 of 2)), but the number may be distributed 
over any or all of 4 dimensions: 

• subclass of faults (1 thru 8) 

-/ • mode (0, 1 or 2) 

• error type (1 = permanent, 2 = transient) 
0 mechanism ( 1 thru 20) 

Begin reading D/I/R mechanism definitions (function 
number selections) at the rate of one mechanism (for 
one fault subclass) per card. The selections must 
completely define the characteristics of the 4 pro- 
cesses comprising the mechanism, for all conditions 
of mode and error type. 

Read one function specification for one process 
(detection, isolation, recovery) of the recovery 
system. Function number and specifications use a 
common fixed format (based on the scheduled detection 
function) , but fields which do not apply may be left 
blank'. 
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TABLE 5-2 (Cont.) 

Identifier Control Function 

E - Read an explicitly defined subset of the D/l/R 

mechanism data base. This is a combination of the C 
and F control cards, in, which the function number 
selection is automatic. The formats for the 
axixiliary information (on 2 cards) are identical to 
those of the C and F cards. 

N - Normalize the function specifications. The normaliza- 

tion involves all defined functions for all 4 processes 
A trial integration is performed for detection and 
isolation rate functions over their specified non- 
zero ranges . The respective PI values are then ad- 
justed to obtain an integral value of 1.0 (PI is 
assvimed to be an internal coefficient of each function) 
The recovery probability functions are sampled over 
their specified ranges, and must be — 1.0 at each 
sample point. Finally, the explicit coefficients for 
. ’ all functions (each of 4 specification arrays) are 

tested, and must be il.O. Appropriate messages are 
printed during or after normalization. 

PC - Display the contents of the D/l/R mechanism data base. 

The function numbers selected to define recovery sub- 
systems are printed, with conditions which may cause 
errors flagged with the letter X. 

PF - Display the specifications of all defined functions, 

G - Exit READIN2 and execute the run-set. Coverage and 

reliability models will be exercised if the DEBUG flag 
is not set. Otherwise, the req;uested coverage cal- 
culations are performed and the selected runset input 
data is printed (including coverage and iteration 
values), prior to re-entering READIN2 . 

S - Stop the program (optional, equivalent to an end-of- 

file on input). 
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5.1.3 Specific Input Information 

Inputs for the dual mode model are described in Sections 2.3 
and 3.3, and are relatively easy to comprehend. For reason of its 
greater flexibility, coverage model input requirements are considerably 
more complex, and thus additional learning aids might well prove useful. 

In particular, it is felt that preliminary experimentation, by 
the user, may be required in order to become sufficiently familiar 
with the tool to use it effectively. For this reason, a set of input 
values is provided (cf. Tables 5-4 and 5-5) which reflect the coverage 
system described in Section 2.2. 

Table 5-4 gives specifications for the above referenced detection, 
isolation and recovery characteristics, which, in turn, may then be 
applied to measure the effects of these processes on different subclasses 
of faults, under various conditions, e.g., permanent versus transient 
faults, mode 1 versus mode 2 operation, etc.. Table 5-6 defines the 
individual variables which comprise the specification of a single func- 
tion, (cf. Format 4F in Table 5-3 for individual specification require- 
ments of the 4 processes) as well as the other important variables in 
the coverage model. 


Table 5-5 gives the function .selection data for a complete 
system of fault coverage. For example, the 4 numbers in the upper 
right hand corner of the table (under the heading "Permanent Fault/ 
Mode O") indicate the non-competitive (i.e., stand-alone) effective- 
ness of detector #2 and its associated isolation and recovery 
processes, under a particular set of specified conditions (i.e., 
that a permanent failure has occurred in the stage linked with fault 
subclass 1, and that the recovery is predicted on degenerating 
from mode 1 to mode 2 operation (shown as mode 0)). 
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TABLE 5-4 


FUNCTION SPECIFICATIONS 



Note: column heading mnemonics are defined on the second sheet 

of Table 5-6. 
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TABLE 5-5 . 

FUNCTION SELECTIONS - BY FAULT SUBCLASS 


Mechanism 
Number 


Permanent 


Permanent 


E T 



D 

I 

E 

T 

D 

I 

2 

9 

2 

1 

2 

0 

__ 

3 

11 

1 

1 

1 

11 

1 

5 

2 

1 

1 

1 

2 

1 

6 

3 

1 

1 

1 

3 

1 

2 

9 

2 

1 

2 

0 

— 

3 

12 

2 

1 

1 

12 

2 

4 

10 

1 

1 

1 

10 

1 

5 

16 

2 

1 

1 

16 

2 

6 • 

17 

2 

1 

1 

17 

2 

7 

2 

1 

1 

1 

2 

1 

8 

4 

2 

1 

1 

4 

2 

9 

5 

1 

1 

1 

5 

1 

10 

5 

1 

1 

1 

5 

1 

12 

7 

2 

1 

1 

7 

2 

1 

1 

1 

1 

1 

1 

1 

2 

9 

2 

1 

2 

0 

— 

3 

12 

2 

1 

1 

12 

2 

4 

14 

2 

1 

1 

14 

2 

5 

16 

2 

1 

1 

16 

2 

8 

18 

1 

1 

1 

18 

1 

9 

5 

1 

1 

1 

5 

1 

10 

. 5 

1 

1 

1 

5 

1 

12 

7 

2 

1 

1 

7 

2 

2 

. 9 

2 

1 

2 

0 

— 

3 

13 

2 

1 

1 

13 

2 

4 

15 

2 

1 

1 

15 

2 

5 

16 

2 

1 

1 

16 

2 

6 

3 

1 

1 

1 

3 

1 

8 

19 

1 

1 

1 

19 

1 

9 

' 5 

1 

1 

1 

.5 

1 

10 

5 

1 

1 

1 

5 

1 

11 

6 

1 

1 

1 

6 

1 

12 

8 

1 

1 

1 

8 

1 

3 

11 

1 

1 

1 

11 

1 

6 

3 

1 

1 

1 

3 

1 


Transient 


D I 


Transient 


Permanent 
Fault/Mode 0 


Subclass 1 

« 

9 3 

2 2 0 

12 2 1 

12 3 1 

Subclass 2 
9 3 

2 2 0 - 

2 2 0 - 

1 .2 16 3 

2 2 17 3 

12 2 1 

2 2 0 - 

12 5 3 

12 5 3 

2 2 0 - 

Subclass 3 
12 11 
9 3 

2 2 0 - 

2 5 0 - 

12 16 3 

2 2 18 3 

12 5 3 

12 5 3 

2 2 0 " 

Subclass 4 
9 3 

2 2 0 - 

2 2 0 - 

12 16 3 

12 3 1 

2 2 0 - 

12 5 3 

12 5 3 

12 6 1 

2 2 0 - 

Subclass 5 
2 2 0 - 

12 3 1 


E 

T 

D 

I E 

T 

D 

I 

E 

T 

1 

2 

0 



9 

2 

1 

2 

— 


0 

— 


11 

1 

1 

1 

1 

1 

2 

1 1 

2 

2 

1 

1 

1 

1 

1 

3 

1 1 

2 

3 

1 

1 

1 

1 

2 

0 

— 


9 

2 

1 

2 

— 


0 

— 


12 

2 

1 

1 

— 


0 

— 


10 

1 

1 

1 

1 

1 

16 

3 1 

2 

16 

2 

1 

1 

1 

1 

17 

3 2 

2 

17 

2 

1 

1 

1 

1 

2 

1 1 

2 

2 

1 

1 

1 

— 


0 

— 


4 

2 

1 

1 

1 

1 

5 

3 1 

2 

5 

1 

1 

1 

1 

1 

5 

3 1 

2 

5 

1 

1 

1 

— 


0 

— 


7 

2 

1 

1 

1 

1 

1 

1 1 

2 

1 

1 

1 

1 

1 

2 

0 

— 


9 

2 

1 

2 

— 


0 

— 


12 

2 

1 

1 

— 


0 

— 


14 

2 

1 

1 

; 1 

1 

16 

3 1 

-2 

16 

2 

1 

1 

i 1 

1 

18 

3 2 

2 

18 

1 

1 

1 

1 1 

1 

5 

3 1 

2 

5 

1 

1 

1 

1 1 

1 

5 

3 1 

2 

5 

1 

1 

1 

— 


0 

— 


7 

2 

1 

1 

1 

2 

0 




9 

2 

1 

2 

— 


0 

— 


13 

2 

1 

1 

— 


0 

— 


15 

2 

1 

1 

1 

1 

16 

3 1 

2 

16 

2 

1 

1 

1 

1 

3 

1 1 

2 

3 

1 

1 

1 

— 


' 0 

— - 


19 

1 

1 

1 

1 

1 

5 

3 1 

2 

5 

1 

1 

1 

1 

1 

5 

3 1 

2 

5 

1 

1 

1 

1 

1 

6 

1 1 

2 

6 

1 

1 

1 

— 


0 

— 


8 

1 

1 

1 



0 



11 

1 

1 

1 

. 1 

1 

3 

i 1 

2 

3 

1 

1 

1 


*0 represents a null contribution to coverage. Accordingly, when 

setting up program input data, no entry need be provided. 
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TABLE 5-6 

COVERAGE MODEL PARAMETERS 
COVAGE Arcmments 

• ISU - Fault subclass number (1-8) 

• MD - Mode of operation: 

1 for full up 

- 2 for degenerate 

0 for transitional (from MD = 1 to MD = 2) 

• lET - Fault type code: 

1 for permanent 

- 2 for transient 

• JS - Number of spare LRU's which must be checked out 

before recovery can proceed. (normally 0 or 1) 

Basic Working Variables 

• Function Pointer Arrays (Common Block CVBl). 

- NDET(20,8) - Detection rate function nxmibers 

- NISO(20#8) - Isolation rate fvinction mambers 

- NEPR(20/8) - Error propagation recovery function 

numbers 

- NTLR(20,8) - Time lost recovery function nambers 
where : 

o 1st subscript is D/l/R mechanism 
o 2nd subscript is fault subclass number 

• Function Specification Arrays (Common Block CVB2) 

- FDET(7/200) - Detection fxmction specifications 

- FISO(7,50) - Isolation function specifications 

- FEPR(7,25) - Error propagation recovery function 

specifications 

- FTLR(7,25) - Time lost recovery function specifi 

cations 

where : 

o 1st subscript is word within a specification 
o 2nd subscript is function number within the 
specification array. 
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TABLE 5-6 (Cont.) 

Function Specification Element Descriptors 

# - CODES - (Word #1) - contains 5 packed integer codes: 

o IGFT - Function Model Identifier (specifies a 

characteristic curve) , (0 = impulse, 1 = 

constant, ' 2 = pulse train, 3 = decaying 
exponential) , (0 not applicable for 

recovery) 

o ISCH - Scheduling code (0 = no, 1 = yes) , 

( detectors only) 

o IREP - Repetition factor in minor cycles, 
(scheduled detectors only) 
o IDEF - Function defined flag (0 = no, 1 = yes) , 
(set automatically when specifications 
are input) 

o INTF - Integral defined flag (0 = no, 1 = yes) . 

- COEF - (Word #2) - Explicit function multiplier 

- TDEL - (Word #3) - Delay before the function becomes 
non-zero (must be zero for recovery functions) 

PI, P2, P3 - (Word #4, 5, 6) - Arbitrary parameters 
passed to the function model for evaluation. Note 
that if automatic normalization is requested by the 
user, all function models used by detection or 
isolation functions must use PI as the normalizing 
coefficient 

TDUR - (Word #7) - Time during which the function 
is non-zero (must be zero" for impulse functions) . 

• Computer System' Dependent Variables (Common Blocks 

CVBO, CVB5 and CVB6) 

- PFDS (8) - Probability of detecting a failure in a 

spare LRU during initial checkout, for each 
of 8 fault subclasses 

- TFDS (8) - Time delay incurred in checking out a spare 

LRU, for each of 8 fault subclasses 
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- LCM - Least coinmon multiple of the ni ' s where ni 

is the repetition factor (where applicable) 
of the ith detector* expressed in terms of 
minor cycles* and automatically generated 
by CARE2 

- TMINOR - Duration of a minor cycle, in units com- 

patible with all rate functions 
IFSC(8) - Reliability model stage in which the a 

fault subclass is located, for each of 8 
subclasses 

- FRAC(8) - Fraction of total stage (class) faults 

which occur in a fault siibclass, for 
each of 8 subclasses. (Used in conjunction 
with IFSC to specify the relationship be- 
tween subclasses and stages, thus enabling 
calculation of stage coverage values on the 
basis of weighted average subclass coverages) 

COVAGE Call Level Intermediate Variables 

• NDSAV(20) - Detector rate function nvimbers 

• NISAV(20) - Isolation rate function numbers , 

• NR1SAV(20)- Error Propogation recovery function' numbers 

• NR2SAV(20)- Time lost recovery function nvimbers 

• ICHAR(20) - Characteristic of detection/recovery 

mechanism: 

0 if mechanism is inoperative 

1 if, detector is unscheduled - impulse 

2 if detector, is unscheduled - finite 

- 3 if detector is scheduled-impulse 

- 4 if detector is scheduled-finite 

• IREPAR(20) - Repetition rate of detector (if applicable) 
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TABLE 5-6 (Cont. -3)" 


• PERIAR(20) - Period of repetition (if applicable) of 

detection functions 

• TDELAR(20) - Delay times for detection functions 

• PDIRAR(20) - Non-competitive probability of detection, 

isolation, and recovery, given that JS 
spares must be checked before recovery 
can proceed 

• INDEX - Function of argtunents lET and MD as follows: 

lET MD INDEX 

0 5 

111 
12 2 
2 1 3 

2 2 4 

• GPFLG - Flag indicating the meaning of array CVGP: 

when .FALSE., GPAR = g' (i,^) 
when .TRUE., GPAR = G' (i,r) 


D/I/r Evaluation Level Intermediate Variables 


IMPG (logical) 
IMPH (logical) 
GPAR(N) 


INDEX 


- = .TRUE, when detector is an impulse 

- = .TRUE, when isolator is aij impulse 

- one row of the GPAR ( 20, N) array, 
which is alternately used to hold 
values of g' (i , t ) and G' (i , t ) 

- (constant for COVAGE call) - fvinction 
of MD and lET 


• TGFIN 

• THFIN 

• TRIFIN 

• TR2FIN 


- Last time at which g(t) is non-zero 

- Last time at which h(t') is non-zero 

- Last time at which r' ( t‘) is non-zero 

- Last time at which r' ( t + t>) is non- 
zero 
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5.2 OPERATIONAL LIMITS 

The numerical limits of CMIE2 are shown in Table 5-7. Most of 
these are self explanatory, however particular attention is directed 
to the last item, entitled numerical integration steps. As noted, 
there are three independent step-size controls, applicable to the 
various integration algorithms, all of which use the Simpson 3-point 
parabolic integration scheme. (This standard numerical integration 
method was chosen because of its affinity for exponential curves) . 

For the dual mode model, the integrations performed in subpro- 
grams DCT and DCT2 divide the integrand into 3-point Simpson's Rule 
segments, with a base width of STEP time units. Optimal condition 
accuracy tests performed here show minor variation in the 8th decimal 
digit, given a T/STEP ratio of 10 or more. 

The integrations performed in COVAGE, and its secondary sub- 
programs, use a fixed niimber of 3-point segments controlled by NINT, 
and pre-determine the non-zero portion of the integrands for optimum 
efficiency. Since new function models, of any form, may be added to 
the- current set (cf. section 4,2.4.21), it necessarily falls to 
the user to, determine their accuracy. This may be accomplished by 
experimenting with NINT, which may be set at run time via control 
card ID in READIN2, 

The last integration control is variable N in FINTEG (the 
coverage characteristic curve (function) integrator) . N may be 
altered by recompiling FINTEG. 

The user is encouraged to provide a corresponding integral 
model for each function model that is added, thereby precluding the 
need for numerical integration (cf. Section 4.2.4.15), Otherwise, 
the corresponding function specifications must be tagged as "integral 
not defined" (INTF=0) , and the attendant inaccuracies of an additional 
integration contended with (cf. INTF discussion in section 4, 2. 4. 9). 
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TABLE 5-7 

OPERATIONAL LIMITATIONS 


Item 


Stages per equation type 

- equations 1 thru 6 

- equation 7 

Stages per system 

Total Equations of type 

- 1 thru 6 per system 

- 7 per system 

Fault classes/subclasses 

- per stage (equations 1,4, 5,6) 

- per stage (equations 2,3,7) 

- per system 

Fault types 

- equations 1 thru 6 

- equation 7 

Fault rates per stage 

- equations 1, 2, 3,4 

- equations 5,6 

- equation 7 

Modes 

- equations 1,2, 3, 5,6 

- equation 4 

- equation 7 

Calculated Coverage types 

- equations 1,4, 5,6 

- equations 2,3 

- equation 7 


Upper Limit Comment 


1 

a 

10 


10 

1 


0 

8 

8 


1 (permanent) 

2 (permanent and transient) 


2 (on-line and standby) 

1 (on-line) 

3 (on-line, standby and 

transient) 

1 (full-up) 

2 (triplex, simplex) 

2 (full-up, degenerate) 


0 

1 (permanent) 

5 /permanent modes 1 and 2, 

I transient modes 1 and 2, 

\ transitional 
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Item 


TABLE 5-7 (Cont.) 
Upper Limit 


Detectors per fault subclass 

- equations 1,4, 5, 6 0 

- equations 2,3,7 20 

D/I/R mechanisms per system 20 

Function specifications per system 

- Detection 200 

- Isolation 50 

- Recovery type E 25 

- Recovery type T 25 

Function models 4096 

Variable parameters per 3 

function model 

Time steps 121 

Parameter variations in run-set 16 


Comment 


(4 currently defined) 


Run-sets per batch run 
Finite K 


No limit 


< 10 ' 


(higher value, specified as 
\/n, calls K= 00 model) 


Numerical integration steps 


- Equation 7 

2*[T/STEP] 


- Coverage model 

100 

(may be changed between 
run-sets) 

- Function models with 
undefined integrals 

20 

(may be increased w/o 
limit by changing data 


statement) 
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SECTION 6 

EXAMPLES AND TEST CASE DATA 

The test case data contained in this section was submitted 
to Langley Research Center, under separate cover, in compliance with 
Work Statement Paragraph 9.0, The sample problems so outlined were 
executed, first at the Raytheon computer facility and second at the 
Langley computer facility, in order to demonstrate both proper program 
execution and transferability between sites. 

Test cases and data were selected on the combined basis of 
reflecting realistic problem input situations, and exercising both 
new interface code and new subroutines as supplied by Raytheon. In 
addition, the cases were chosen to serve as a tutorial base both for 
new users of CARE/CARE2 and also for those with previous experience 
in original CARE, but unfamiliar with the changes incorporated in the 
revised program. 

Each of the listings is annotated in order to facilitate in 
its interpretation. In the interest of brevity, all but the first 
reliability plot has been deleted. 
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6.1 JOB #1 - PRODUCT OF EQUATIONS 4 AND 2 

Specifics : 

• The first stage is a hybrid/simplex configuration 
consisting of 3 active LRU' s and 1 spare LRU. 

• The second stage is a standby replacement configu- 

. ration consisting of 2 series LRU's, each with a 

single dedicated spare LRU. 

-4 

• Lambda, for both stages, equals 1.0*10 failures 
per hour per LRU. 

-4 

• Mu, for the first stage, equals 1.0*10 failures 

per hour per LRU initially and then, in sequence, 

-5 

takes on the values 5.0*10 and 0. 0. respectively . 

• Mu, for the second stage, is held constant at 0.0. 

4 

• Mission time equals 3*10 hours. 

• Time step (for print and plot purposes) equals 
. 10^ hours, . 

• Coverage and Voter Reliability are defaulted. 

• Output request is for both tabular reliability print- 
out and plot. 

6.2 JOB #2 - PRODUCT OF EQUATION 4 AND 2, WITH CALCULATED 

COVERAGE 

Specifics : 

• Same as Job #1 except for Coverage values as related 
to permanent faults. 

• Coverage for the second stage is to be calculated. 
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• Two stage 2 detectors are employed, each with the 
following characteristics: 

Scheduled 

Truncated exponential distribution 
95% detection probability 
Run time of 0.01 seconds 
One repetition per second 

First delayed 0.0 seconds, second delayed 0.5 seconds 

• Given detection of the fault, isolation and recovery 
probabilities are unity. 

6.3 JOB #3 - DUAL MODE MODEL (DUAL CHANNEL) WITH PRESET COVERAGE 

Specifics : 

• Dual Channel configuration with two stages. 

• Each stage has a single spare LRU. 

• Spares reassignment, in the event of a single channel 
loss, is not allowed. 

• Operation in degenerate mode (mode 2) is allowed, and 
requires one fully operational channel. 

-4 

• Lambda, for both stages, is equal to 1*10 failures 
per hour. 

• Mu, for both stages, is equal to 5.0*10 ^ failures 
per hour. 

• The transient error rate, for both stages, equals 

-5 

1 . 1*10 per hour, 

• Category 2 and 3 switch failure rates are 1.0*10 ^ 

-8 

and 2.0*10 per hour respectively. 
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• Coverage values for the first stage are unity. 

• Coverage values for the second stage, for full-up/ 
degenerate operation respectively, are; 

<* ' o Permanent - 0.999/0.98, with deltas of 

0.999/0.998 

o Transitional - 0.999, with delta of 0.999 
o Transient - 0.99/0.95 

4 

• Mission time equals 10 hours. 

3 

• Time step equals 10 hours. 

• Output request is for both tabular reliability 
printout and plot. 

JOB #4 - DUAL MODE MODEL (DUAL CHANNEL) WITH PRESET COVERAGE, 
MULTIPLE RUNS WITHIN RUNSET, AND SPARES REASSIGNMENT 
Specifics:' 

• Same as Job #3 except for run time variation in 
spares quantity, and allowance of spares reassignment. 

• Parameter selected for variation is the quantity of 
spare LRU' s associated with Stage 1. 

• This quantity takes on the values 0, 1 and 2 for 
runs 1, 2 and 3 respectively. 

• Reassignment of spares, in the event of singular 
channel loss, is allowed. 
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6.5 JOB #5 - DUAL MODE MODEL (DUAL CHANNEL) WITH CALCULATED 

COVERAGE AND MULTIPLE RUNSETS 

Specifics : 

• Same as Job #3 except for Stage 2 coverage and 
multiple Stage 1 parameter variations. 

• Lambda/Mu values , for the first stage, are as 
follows : 

-4, -5 . 

o Run 1, Runset 1 - 1.0*10 /5,0*10 failures 

per hour. 

-4 -4 

o Run. 2, Runset 1 - 1.0*10 /1. 0*10 failures 

per hour. 

-3 -4 

o Run 1, Runset 2 - 1.0*10 /5.0*10 failures 

per hour. 

o Run 2, Runset 2 - 1.0*10 ^/1.0*10 ^ failures 

per hour. 

• Coverage, for Stage 2, is to be calculated based on 
the presence of Output Compare and CPU Selftest 
D/I/R mechanisms, using default data values. 

6.6 JOB #6 - DUAL MODE MODEL (HYBRID CHANNEL) WITH CALCULATED 

, COVERAGE AND MULTIPLE RUNSETS 

Specifics: 

• The first stage requires 3 operating LRU* s in 

mode 1, 2 in mode 2, and has 1 spare LRU. 

• The second stage requires 1 operating LRU in both 
modes, and has 2 spare LRU's. 

• The third stage requires 2 operating LRU' s in 

mode 1, 1 in mode 2, and has 1 spare LRU. 


6-5 




RAYTHEON C O M P A N Y 

EQUIPMENT D I V I S -vl^ O N 

• Lambda, for stages 1, 2 and 3, equals 1.0*10 

-4 -4 

2.5*10 and 0.5*10 failures per hour per LRU, 
respectively. 

• K, for all stages, equals 2.0. 

• Spares reassignment is allowed. 

• Coverage values for the first stage, for mode 1/mode 2 
operation respectively, are 


o 

Permanent 

- 0.998/0.99, 
0.995/0.99 

with deltas of 

“ o 

Transitional 

- 0.996, with 

delta of 0.995 

o 

Transient 

- 0.99/0.97 



• Coverage for the second stage is to be calculated 
based on the presence of CPU Selftest, Invalid In- 
struction and CPU Code, D/I/R mechanisms, using' 
default data values. 

• Coverage for the third stage is to be calculated 
assuming 2 fault subclasses, as described in suc- 
ceeding line items. 

• The first fault subclass accounts for 63^ of LRU 
faults, and employs memory code as the sole D/I/R •• 
mechanism. Default data values are to be used. 

• The second fault subclass accounts for 37% of LRU 
faults, and employs address feedback as the sole 
D/I/R mechanism. Default data values are to be used 
only where not over-ridden by the following: 
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o Checkout time, for a spare, equals 0.1 seconds 
o Detection probability equals 0.9996 
o Isolation probability equals 0.99985 and has 
a pulse distribution of length 0.2 seconds 
o Error propagation recovery coefficient equals 
0.96 for mode 2 operation 

• A second run to be performed, in which CPU Selftest 
for Stage 2 is to be executed every 0.4 seconds. 

3 

• Mission time equals 16*10 hours. 

3 

• Time step equals 2*10 hours. 

• Output request is for tabular reliability data only. 

6.7 JOB #7 - PRODUCT OF EQUATIONS 7, 4 AND 2 

Specifics: 

• Run consists of producing a product reliability, 
assummina the conf iaurations of Job #3 and Job #1 
to be in series in a reliability block diagram. 

• All data values are as defined in the respective 

jobs with the exception of Mission time, which is 
4 

set equal to 10 hours for the combined configuration. 


f 
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FIGURE 6-1 

JOB #1 INPUT/OUTPUT LISTINGS 
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RUN 2 RELIABILITY RESULTS FIGURE 6-1 (Cont.-8) 
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FIGURE 6-2 

JOB #2 INPUT/OUTPUT LISTINGS 
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ID 
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RUN 1 ReLlABltlTY ReSutTS FIGURE 6-2 (Cont.-6) 
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CARE2 (COMPUTER>AIOEO reliabiuty estimation) 
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FIGURE 6-3 (Cent. -3) 
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FIGURE 6-4 

JOB #4 INPUT/OUTPUT LISTINGS 
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FIGURE 6-5 

JOB #5 INPUT/OUTPUT LISTINGS 

iUf'TSON GUTPTl=T» rjUTPT5»T» PRDOTsT# LSTCHbT, COVPRCbT , STG0UT*T» 
SVARl PHOn(ns7»7$ 

$VA« T= 1 . 0 Ea, STEPsl,0F3» UPTlOMelS 

SCavCAL CnvINT=7. lGENCf 2 )si, TGENP (2)«1 » lFSCf?)=2» FRAC(25*i.0S 


!B 

sparvfc ai(ns 2 , 2 , « 2 (n«i,i» sci)*! 

LAM( n=2*l ,0E-.U» Mu(ns2*5,0F-5» 
ID 

SDATA SLH2=1 ,0E-7» SLH5a2,0E-8, RSGN 
IV a 

SVARY PAWAM(l,ns5,0E-5/1.0E-aS 


F 

0 

9 

310 

l.o 

0,0 

F 

u 

10 

311 

1 .95 

0.0 

F 

I 

1 

01 

1.0 

0.5> 

F 

I 

2 

11 

.99 

0,0 

F 

1 

3 

01 

0.8 

0.0 


i, 

GHPtl)s2*i.iE-5S 
F» THINQRbIOOO.OS 


.091011 

,09 

0.0 

50.0 

,72179(^ 

, .7 

0.0 . 

5.0 

.oa 



25,0 


6-40 



RAYTHEON COMPANY 

[^AYTHEO^ 


EQUIPMENT DIVISION 




FIGURE 6-5 (Cont.-2) 
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CiRM (COMRUTfR»*II5lO RELIABILITY IBTIMATION) FIGURE 6-5 (Cont.-3) 
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NUHBER op stages 2 

CHANNEL PAXLURE RATE .00000010 

SVSTEH failure RATE .00000002 

channel FAILURE COV. l.OOOOOOOO 
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«UN 1 RBLIABILITY RESULTS FIGURE 6-5 (Cont.-7) 
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FIGURE 6-6 

JOB #r- input/output listings 
SOPTSON UUTPTlsT, OUTPTJsT, PWOOTsT, COVPRC*T » STG0UT«T$ 

SVAWl PROD(l)s7,7/7S 

$VAR T=l, 6 ta» STF-;Ps 2 , 0 E 3 » OPTlONsU 

SCUVCAU C0VI^75T, lGENr(2)=-l»l» I CENP (2 ) e i , l » 

1 FSC( 2 ) 52 , 3 , 3 , FRAc( 2 )s 1 , 0 , 0 , 63 » 0 , 37 S 

IB 

SPAHVEC OlU)s 3 ,l. 2 » a 2 Cl)s 2 ,l,l, S{n»l, 2 ,l, L AM ( 1 ) s I , OF- 5 » 2 , OE -a , S , Oe -5 . 
MU(na 5 , 0 E-fe,l. 25 E-a, 2 , 5 E- 5 » Cltl)s, 908 » C 2 ( 1 )b, 99 , CDlfn»,P 95 » 
C 02 (ns, 99 , CTRCns, 996 , COTR ( 15 s , 995 , PRClfl)a, 99 , PRC 2 (l)s, 97 S 
ID 

SDAJA SLH 2 S 0 . 0 # SLHSbO.A, RSGNsT, TFDS ( 9 5 * 1 00 . , TMlNORa I eOO , 0 S 


F 

15 

1 

010 


O 

• 

o 



F 

0 

2 

010 

.25 

o 

• 

o 



F 

0 

5 

010 

,0?5 

o 

• 

o 



F 

0 

10 

311 

1 .95 

o 

• 

o 

, 72179#, 

0,7 0,0 5,0 

F 

U 

20 

01 

.9996 

o 

• 

o 



4 

1 

1 

01 

1.0 

0.5 



F 

1 

3 

01 

0 .« 

20.0 



F 

1 

a 

11 

,99985 

o 

• 

o 

.005 

200.0 


6-51 




RAYTHEON COMPANY 

^AYTHEOf^ 


equipment division 




FIGURE 6-6 (Cont.-2) 
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C0VE»A6F EyNCTION EMCIFICATJ0N8 PIGURE 6-6 (Cont- -5) 
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FIGURE 6-7 

JOB #7 INPUT/OUTPUT LISTINGS 

SOPTSHN OUTPTI=T,OUTPT3sT»LPLOT=T*PRODTst,L.STCh=T»DEFChNGsT,8TGOUT=TS 
SVAHI pt<UD(ns7,7,a,2$ 

SVAp T=1.0t«» STtP=l,0f3, OPTIfjNelS 
SDE.FAULT LAMDtpsi ,oE-a$ 
lOOOO 
lOOOOO 

la 

SPAWVFC 01 ( 1 , 02(0*1,1, S(l)«a*l, Z(?5*2» 

MU(n*2*!>,0E-5, 1 ,OF-a»0.O, GMP ( 1 5*2* 1 , lE-5, 

Cl(2)s.PP9» CDK25S.999, C2(2)*,98, CD2C2)a,998, CTR(25c,999, CDIR ( 2 5 * .999 , 
PRC1(2)*,99, PpC2(?5s,95$ 

ID 

*DATA SLH2*1 ,0fc-7,SLH3B2.0t-8, RSGN*F$ 

IV a 

SVARY PARAH(l,3)al,0E-a,5,0E-5,0.0, PAR AM ( l , « ) xO , 0 S 
G 
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dual mode system parameters 
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RUN 2 RrL!*BIl.ITY RESmLTS FIGURE 6-7 (Cent. -6) 
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SECTION 7 

SUMMARY AND CONCLUSIONS 


The mathematical models and computer program described in 
this report are viewed by the authors as but an early step in the 
development of adequate, fault tolerant computer, design evaluation 
aids . 

The tool provided is by no means a panacea. For example, 
the requirement for user determination of individual fault detection, 
isolation and recovery process characteristics, is viewed as a rather 
significant drawback. There are others as well. 

Nevertheless, we also view it as a major improvement over 
other available reliability evaluation aids. For example, the pro- 
vision for two distinct operating modes contrasts favorably with the 
single mode approximations typically encountered. Similiarly, the 
inclusion of a means for calculating multiple coverage factors, 
conditioned on failure type and location, operating mode, and spares 
status, goes far beyond the more usual mathematical reliability model. 

In addition, the generality of the model enables its use on 
a wide variety of system configurations, and variations therein. These 
include TMR, TMR with spares switching, hybrid, and a majority of 
others as well. 

There remains much to be done, however, prior to the trans- 
formation of fault tolerant computer design from that of a rather 
mystical art, to that of a. science. Most noteworthy, perhaps, is 
the task of gathering adequate failure rate statistics within inte- 
grated circuit chips, and also the task of establishing and imple- 
menting measurement criteria for individual fault detection, isolation 
and recovery processes. Hopefully, these will prove to be forthcoming 
from other endeavors. 
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THE CONTROL CODE IN COL.S 1-i WILL DETERMINE WHICH 
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FOP THE JTH TIME 8TEP# COMPUTE THE PROBABILITY 

THAT The dual MUDE system will have survived 

AND HAVE degraded DUE TU A CATAGOPY 2 SWITCH FAILURE 
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